我有几个大的数据。表保存到磁盘中的。rds文件。我正在寻找减少导入数据所需时间的方法。我在看feather包。我的流水线的一部分是检查基于digest::sha1()
的输入数据集的任何更改。如下面的示例所示,可以读入保存为rds的data.table,并且digest::sha1()
相等。但是,保存为.feather
文件的数据,读入并修改为相同的data.tabel,会导致不同的sha1哈希值。我很困惑,因为检查all.equal
和indentical
返回TRUE
,但哈希值是唯一的。
为什么会这样?是否可以使用这种类型的工作流获得相同的哈希值?如果我不能依赖哈希,我如何轻松地检查数据是否已更改?(真实的数据是几百万行乘几百列)。
library(data.table)
library(feather)
# build an example data set
set.seed(42)
original_data_table <-
data.table(
x = rnorm(100),
y = factor(sample(1:3, size = 100, replace = TRUE), levels = 1:3, labels = c("lvl1", "lvl2", "lvl3"))
,
id = paste0("subject_", 1:100)
)
data.table::setkey(original_data_table, id)
# write data as rds
original_data_table_rds <- tempfile()
original_data_table_feather <- tempfile()
saveRDS(object = original_data_table, file = original_data_table_rds)
feather::write_feather(x = original_data_table, path = original_data_table_feather)
# read in the data objects
from_rds <- readRDS(original_data_table_rds)
from_rds_setted <- data.table::setDT(readRDS(original_data_table_rds))
from_feather <- feather::read_feather(original_data_table_feather)
# translate from_feather from tibble to data.table
data.table::setDT(from_feather)
data.table::setkey(from_feather, id)
# check that objects are equall, and even identical, to the original
all.equal(from_rds, original_data_table) # TRUE
#> [1] TRUE
all.equal(from_rds_setted, original_data_table) # TRUE
#> [1] TRUE
all.equal(from_feather, original_data_table) # TRUE
#> [1] TRUE
identical(from_rds, original_data_table) # FALSE
#> [1] FALSE
identical(from_rds_setted, original_data_table) # TRUE
#> [1] TRUE
identical(from_feather, original_data_table) # TRUE
#> [1] TRUE
digest::sha1(original_data_table)
#> [1] "4036943c7692f18fc4be615eb1821c29fa17935a"
digest::sha1(from_rds)
#> [1] "4036943c7692f18fc4be615eb1821c29fa17935a"
digest::sha1(from_rds_setted)
#> [1] "4036943c7692f18fc4be615eb1821c29fa17935a"
digest::sha1(from_feather)
#> [1] "cf8cefcc706cbfe343986aa366d44bf2bf965712"
sessionInfo()
#> R version 4.3.1 (2023-06-16)
#> Platform: x86_64-apple-darwin20 (64-bit)
#> Running under: macOS Monterey 12.6
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> time zone: America/Denver
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] feather_0.3.5 data.table_1.14.8
#>
#> loaded via a namespace (and not attached):
#> [1] digest_0.6.33 utf8_1.2.3 fastmap_1.1.1 xfun_0.39
#> [5] magrittr_2.0.3 glue_1.6.2 tibble_3.2.1 knitr_1.43
#> [9] pkgconfig_2.0.3 htmltools_0.5.5 rmarkdown_2.23 lifecycle_1.0.3
#> [13] cli_3.6.1 fansi_1.0.4 vctrs_0.6.3 reprex_2.0.2
#> [17] withr_2.5.0 compiler_4.3.1 tools_4.3.1 hms_1.1.3
#> [21] pillar_1.9.0 evaluate_0.21 Rcpp_1.0.11 yaml_2.3.7
#> [25] rlang_1.1.1 fs_1.6.2
字符串
创建于2023-07-16带有reprex v2.0.2
1条答案
按热度按时间fnvucqvd1#
感谢@流浪汉对这个问题的评论,我找到了解决方案。@流浪汉是对的,这和属性有关。
original_data_table
和from_feather
的属性作为一个集合是相同的,但是提供列表元素的顺序不同。通过在
identical
调用中设置attrib.as.set = FALSE
,我们可以看到存在差异。查看original_data_table
和from_feather
之间的属性名称,我们可以看到第一个和第三个元素是两个对象之间的交换位置。通过将from_feather
的属性元素的顺序设置为与original_data_table
相同的顺序,digest::sha1
的值与预期的一样。字符串
创建于2023-07-17带有reprex v2.0.2