如何对同时包含AND和OR运算符的语句执行条件连接?

k3bvogb1  于 2022-12-20  发布在  其他
关注(0)|答案(1)|浏览(124)

我有一个大型数据集,希望形成满足某个条件的所有行对,然后根据满足的条件部分计算一些变量。下面的MWE说明了我希望实现的目标:

library(data.table)
set.seed(1234)

IDs <- data.table(id = letters[1:10],
                  c1 = sample(1:5, 10, replace = T),
                  c2 = sample(1:5, 10, replace = T),
                  c3 = sample(1:5, 10, replace = T),
                  c = 1)

IDs.joined <- IDs[IDs, on = 'c', allow.cartesian = T
                  ][c1 != i.c1 & (c2 == i.c2 | c3 == i.c3)  # condition defining which pairs are joined
                  ][, c('Ic2', 'Ic3') := .(c2 == i.c2, c3 == i.c3)
                  ][, overlap_id := fifelse(Ic2 == 1, 2, 3)
                  ][, overlap := Ic2 + Ic3
                  ][, -c('i.c1', 'i.c2', 'i.c3', 'Ic2', 'Ic3')]

问题是整个数据集太大了(约500万行),无法在其自身上形成笛卡尔连接。我的问题是,是否有一种方法可以使用data.table的语法直接执行这样的条件连接,而不必先通过笛卡尔连接,然后再施加所需的条件?
我在SO上看到过类似的问题,但这些问题通常可以表示为滚动连接,我不知道在滚动连接语法中包含X | Y语句或X != Y条件的方法。

bcs8qyzn

bcs8qyzn1#

到目前为止,我发现对于像这样相对简单的情况,最好的选择是绑定多个连接,这并不漂亮,但它速度快,内存效率高。
数据:

library(data.table)
set.seed(1234)

IDs <- data.table(id = 1:1e4,
                  c1 = sample(5e3, 1e4, replace = T),
                  c2 = sample(5e3, 1e4, replace = T),
                  c3 = sample(5e3, 1e4, replace = T))

具有单个笛卡尔连接的原始解,然后进行所需的过滤:

f1 <- function(dt) {
  on.exit(try(dt[,c := NULL], TRUE))
  dt[
    ,c := 0L
  ][
    dt, on = 'c', allow.cartesian = TRUE
  ][
    c1 != i.c1 & (c2 == i.c2 | c3 == i.c3)
  ][
    ,c := NULL
  ]
}

使用四个不等连接,然后过滤掉重复项的解决方案:

f2 <- function(dt) {
  setorder(
    rbindlist(
      list(
        dt[dt, on = .(c1 > c1, c2 == c2), .(id = x.id, c1 = x.c1, c2 = x.c2, c3 = x.c3, i.id = i.id, i.c1 = i.c1, i.c2 = i.c2, i.c3 = i.c3), nomatch = 0],
        dt[dt, on = .(c1 < c1, c2 == c2), .(id = x.id, c1 = x.c1, c2 = x.c2, c3 = x.c3, i.id = i.id, i.c1 = i.c1, i.c2 = i.c2, i.c3 = i.c3), nomatch = 0],
        dt[dt, on = .(c1 > c1, c3 == c3), .(id = x.id, c1 = x.c1, c2 = x.c2, c3 = x.c3, i.id = i.id, i.c1 = i.c1, i.c2 = i.c2, i.c3 = i.c3), nomatch = 0],
        dt[dt, on = .(c1 < c1, c3 == c3), .(id = x.id, c1 = x.c1, c2 = x.c2, c3 = x.c3, i.id = i.id, i.c1 = i.c1, i.c2 = i.c2, i.c3 = i.c3), nomatch = 0]
      )
    )
  )[
    c2 != i.c2 | c3 != i.c3 | rep(!0:1, .N/2L)
  ]
}

使用两个笛卡尔连接的解决方案,每个连接后面都有附加过滤:

f3 <- function(dt) {
  rbindlist(
    list(
      dt[dt, on = "c3", allow.cartesian = TRUE][c1 != i.c1][, i.c3 := c3],
      dt[dt, on = "c2", allow.cartesian = TRUE][c1 != i.c1 & c3 != i.c3][, i.c2 := c2]
    ),
     use.names = TRUE
  )
}

基准:

microbenchmark::microbenchmark(f1(IDs),
                               f2(IDs),
                               f3(IDs),
                               times = 10)
#> Unit: milliseconds
#>     expr       min        lq       mean     median        uq       max neval
#>  f1(IDs) 2292.6852 2400.9380 2398.13919 2406.86010 2418.6276 2459.2945    10
#>  f2(IDs)  366.6586  367.2055  373.34916  373.99310  378.5482  383.0736    10
#>  f3(IDs)    6.5007    6.5537    7.24775    6.88965    7.0150   10.6413    10

identical(setorder(f1(IDs)), f2(IDs))
#> [1] TRUE
identical(f2(IDs), setorder(f3(IDs)))
#> [1] TRUE

相关问题