R语言 使用精确值加上最接近的值执行左联接

eivgtgni  于 2023-04-09  发布在  其他
关注(0)|答案(3)|浏览(167)

我有两个数据集:

table1 <- data.frame(id=c(1000,1001,1002,1003), 
                    date=as.POSIXct(c("2012-05-13","2012-09-23","2011-04-09","2014-11-08"))) 

table2 <- data.frame(id2=c(1000,1000,1001,1002,1003,1003), 
                   date2=as.POSIXct(c("2012-05-13","2012-05-16","2012-09-24","2011-04-15","2014-11-09", "2014-11-10")))

我想根据匹配的ID和Date对table 1执行左连接,但并非所有日期都完全匹配,因此我想知道如何根据最近的一天连接日期?例如,对于id 1001,“2012-09-23”将匹配id 2 1001的“2012-09-24”,因为它是id 2的唯一日期,并且对于1003,“2014-11-08”将匹配“2014-11-09”1003,因为它是最接近的一天。
预期结果:

id       date      date2
1 1000 2012-05-13 2012-05-13
2 1001 2012-09-23 2012-09-24
3 1002 2011-04-09 2011-04-15
4 1003 2014-11-08 2014-11-09
6jjcrrmo

6jjcrrmo1#

我还建议您遵循非等data.table连接,但如果您出于任何原因想要坚持使用dplyr,并且您的数据不是很大或者您有足够的内存,您也可以尝试:

library(dplyr)

table1 %>%
  left_join(table2, by = c("id" = "id2")) %>%
  group_by(id) %>%
  slice(which.min(abs(date - date2)))

输出:

# A tibble: 4 x 3
# Groups:   id [4]
     id date                date2              
  <dbl> <dttm>              <dttm>             
1  1000 2012-05-13 00:00:00 2012-05-13 00:00:00
2  1001 2012-09-23 00:00:00 2012-09-24 00:00:00
3  1002 2011-04-09 00:00:00 2011-04-15 00:00:00
4  1003 2014-11-08 00:00:00 2014-11-09 00:00:00
dgenwo3n

dgenwo3n2#

使用data.table执行到最近值的滚动连接。DT 1是通过引用更新的,因此它应该非常快,即使在大型/大数据上也是如此

library(data.table)

样本数据

dt1 <- as.data.table(table1)
dt2 <- as.data.table(table2)

编码

dt1[, date2 := dt2[dt1, date2, on = c("id2 == id", "date2 == date"), roll = "nearest"]][]

输出

#      id       date      date2
# 1: 1000 2012-05-13 2012-05-13
# 2: 1001 2012-09-23 2012-09-23
# 3: 1002 2011-04-09 2011-04-09
# 4: 1003 2014-11-08 2014-11-08
a0zr77ik

a0zr77ik3#

随着dplyr 1.1.0的出现,它获得了两个辅助函数join_by()closest,现在它非常容易

table1 <- data.frame(id=c(1000,1001,1002,1003), 
                     date=as.POSIXct(c("2012-05-13","2012-09-23","2011-04-09","2014-11-08"))) 

table2 <- data.frame(id2=c(1000,1000,1001,1002,1003,1003), 
                     date2=as.POSIXct(c("2012-05-13","2012-05-16","2012-09-24","2011-04-15","2014-11-09", "2014-11-10"))) 
library(dplyr)
#> 

table1
#>     id       date
#> 1 1000 2012-05-13
#> 2 1001 2012-09-23
#> 3 1002 2011-04-09
#> 4 1003 2014-11-08
table2
#>    id2      date2
#> 1 1000 2012-05-13
#> 2 1000 2012-05-16
#> 3 1001 2012-09-24
#> 4 1002 2011-04-15
#> 5 1003 2014-11-09
#> 6 1003 2014-11-10

table1 %>% 
  left_join(table2, by = join_by(id == id2,
                                 closest(date <= date2)))
#>     id       date      date2
#> 1 1000 2012-05-13 2012-05-13
#> 2 1001 2012-09-23 2012-09-24
#> 3 1002 2011-04-09 2011-04-15
#> 4 1003 2014-11-08 2014-11-09

创建于2023-04-07带有reprex v2.0.2

相关问题