identical(X1,X2)为TRUE,但digest::sha1(X1)!= digest::sha1(X2)

r3i60tvu  于 2023-07-31  发布在  其他
关注(0)|答案(1)|浏览(88)

我有几个大的数据。表保存到磁盘中的。rds文件。我正在寻找减少导入数据所需时间的方法。我在看feather包。我的流水线的一部分是检查基于digest::sha1()的输入数据集的任何更改。如下面的示例所示,可以读入保存为rds的data.table,并且digest::sha1()相等。但是,保存为.feather文件的数据,读入并修改为相同的data.tabel,会导致不同的sha1哈希值。我很困惑,因为检查all.equalindentical返回TRUE,但哈希值是唯一的。
为什么会这样?是否可以使用这种类型的工作流获得相同的哈希值?如果我不能依赖哈希,我如何轻松地检查数据是否已更改?(真实的数据是几百万行乘几百列)。

library(data.table)
library(feather)

# build an example data set
set.seed(42)

original_data_table <-
  data.table(
             x = rnorm(100),
             y = factor(sample(1:3, size = 100, replace = TRUE), levels = 1:3, labels = c("lvl1", "lvl2", "lvl3"))
             , 
             id = paste0("subject_", 1:100)
  )

data.table::setkey(original_data_table, id)

# write data as rds
original_data_table_rds <- tempfile()
original_data_table_feather <- tempfile()
saveRDS(object = original_data_table, file = original_data_table_rds)
feather::write_feather(x = original_data_table, path = original_data_table_feather)

# read in the data objects
from_rds        <- readRDS(original_data_table_rds)
from_rds_setted <- data.table::setDT(readRDS(original_data_table_rds))
from_feather    <- feather::read_feather(original_data_table_feather)

# translate from_feather from tibble to data.table
data.table::setDT(from_feather)
data.table::setkey(from_feather, id)

# check that objects are equall, and even identical, to the original
all.equal(from_rds, original_data_table)         # TRUE
#> [1] TRUE
all.equal(from_rds_setted, original_data_table)  # TRUE
#> [1] TRUE
all.equal(from_feather, original_data_table)     # TRUE
#> [1] TRUE

identical(from_rds, original_data_table)         # FALSE
#> [1] FALSE
identical(from_rds_setted, original_data_table)  # TRUE
#> [1] TRUE
identical(from_feather, original_data_table)     # TRUE
#> [1] TRUE

digest::sha1(original_data_table)
#> [1] "4036943c7692f18fc4be615eb1821c29fa17935a"
digest::sha1(from_rds)
#> [1] "4036943c7692f18fc4be615eb1821c29fa17935a"
digest::sha1(from_rds_setted)
#> [1] "4036943c7692f18fc4be615eb1821c29fa17935a"
digest::sha1(from_feather)
#> [1] "cf8cefcc706cbfe343986aa366d44bf2bf965712"

sessionInfo()
#> R version 4.3.1 (2023-06-16)
#> Platform: x86_64-apple-darwin20 (64-bit)
#> Running under: macOS Monterey 12.6
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: America/Denver
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] feather_0.3.5     data.table_1.14.8
#> 
#> loaded via a namespace (and not attached):
#>  [1] digest_0.6.33   utf8_1.2.3      fastmap_1.1.1   xfun_0.39      
#>  [5] magrittr_2.0.3  glue_1.6.2      tibble_3.2.1    knitr_1.43     
#>  [9] pkgconfig_2.0.3 htmltools_0.5.5 rmarkdown_2.23  lifecycle_1.0.3
#> [13] cli_3.6.1       fansi_1.0.4     vctrs_0.6.3     reprex_2.0.2   
#> [17] withr_2.5.0     compiler_4.3.1  tools_4.3.1     hms_1.1.3      
#> [21] pillar_1.9.0    evaluate_0.21   Rcpp_1.0.11     yaml_2.3.7     
#> [25] rlang_1.1.1     fs_1.6.2

字符串
创建于2023-07-16带有reprex v2.0.2

fnvucqvd

fnvucqvd1#

感谢@流浪汉对这个问题的评论,我找到了解决方案。@流浪汉是对的,这和属性有关。original_data_tablefrom_feather的属性作为一个集合是相同的,但是提供列表元素的顺序不同。
通过在identical调用中设置attrib.as.set = FALSE,我们可以看到存在差异。查看original_data_tablefrom_feather之间的属性名称,我们可以看到第一个和第三个元素是两个对象之间的交换位置。通过将from_feather的属性元素的顺序设置为与original_data_table相同的顺序,digest::sha1的值与预期的一样。

digest::sha1(original_data_table)
#> [1] "4036943c7692f18fc4be615eb1821c29fa17935a"
digest::sha1(from_rds)
#> [1] "4036943c7692f18fc4be615eb1821c29fa17935a"
digest::sha1(from_rds_setted)
#> [1] "4036943c7692f18fc4be615eb1821c29fa17935a"
digest::sha1(from_feather)
#> [1] "cf8cefcc706cbfe343986aa366d44bf2bf965712"

identical(from_feather, original_data_table, attrib.as.set = FALSE)
#> [1] FALSE
attributes(original_data_table) |> names()
#> [1] "names"             "row.names"         "class"            
#> [4] ".internal.selfref" "sorted"
attributes(from_feather) |> names()
#> [1] "class"             "row.names"         "names"            
#> [4] ".internal.selfref" "sorted"

attributes(from_feather) <- attributes(from_feather)[c("names", "row.names", "class", ".internal.selfref", "sorted")]
identical(from_feather, original_data_table, attrib.as.set = FALSE)
#> [1] TRUE
digest::sha1(original_data_table)
#> [1] "4036943c7692f18fc4be615eb1821c29fa17935a"
digest::sha1(from_rds)
#> [1] "4036943c7692f18fc4be615eb1821c29fa17935a"
digest::sha1(from_rds_setted)
#> [1] "4036943c7692f18fc4be615eb1821c29fa17935a"
digest::sha1(from_feather)
#> [1] "4036943c7692f18fc4be615eb1821c29fa17935a"

字符串
创建于2023-07-17带有reprex v2.0.2

相关问题