优化R中多个嵌套的for循环

yyyllmsg  于 2023-09-27  发布在  其他
关注(0)|答案(1)|浏览(73)

我正在准备上传到数据库的文件。数据由层次结构中的多个表组成,表通常在层次结构中更大(更多行)。表是不同数据类型的混合,并且具有不同的列数。
每个表的每一行都通过一个ID号链接到层次结构中它上面的表中的一行-本质上是行号(尽管匡威不一定是真的;不是表中的每一行都必须链接到层次结构中它下面的表中的行)。
上传文件的结构如下:最高表的第一行(例如A),然后是链接到它的下一个表(B)的行,然后是链接到B的C的行,然后是链接到C的D中的行,等等。如果没有找到更低的链接,它将停留在层次结构中的该级别,并转到该表中的下一行,然后从那里继续向下。一旦没有更多的链接到上表的相关行,它就会上升一个级别并转到 that 表的下一行。没有一行是重复的,所以最后一个表的nrow是所有表的nrow之和。希望这个(和下面的例子)是有意义的。
我使用以下嵌套的for循环来构建输出,但它非常慢。实际的数据是8或9个嵌套表,有数千行(最终的表是75 K行,在我的Apple Silicon MBP上运行大约需要10-15分钟)。
如何优化或做得更好?我尝试过使用apply函数,但真正挣扎的是如何将函数环境的输出提供给上面的迭代并生成最终的表。
或者有没有一种使用dplyr样式连接的方法?还是data.table语法?还是一种以何种顺序提取哪些行的主索引的方法?
我对数据库不是很有经验,所以也许这是一个常见的结构,还有其他的包可以做这样的事情?

# Create some tables
tableA <- data.frame(Table = "A",
                     A.id = 1:3,
                     A.link.up = NA, # because it's the top table
                     A.data1 = c(10,20,20),
                     A.data2 = c("text", "moretext", "othertext"))

tableB <- data.frame(Table = "B",
                     B.id = 1:6,
                     B.link.up = c(1,1,2,2,2,3),
                     B.data1 = c("text", "difftext", "wtext", "xtext", "ytext", "ztext"),
                     B.data2 = "2023-02-09",
                     B.data3 = 1001:1006,
                     B.data4 = 55:60)

tableC <- data.frame(Table = "C",
                     C.id = 1:10,
                     C.link.up = c(1,2,2,3,4,4,4,4,6,6),
                     C.data1 = 501:510,
                     C.data2 = "blah",
                     C.data3 = "zzz")

# Set up final output table
final_table <- data.frame(matrix(NA, nrow = 0, ncol = 7))
names(final_table) <- c("Originating.Table", "Row.id", "Link.upwards",
                        "various", "different", "data", "formats")

# A
for(a in 1:nrow(tableA)) { # for every row in A
  A.row <- tableA[a,]      # extract row
  final_table[nrow(final_table)+1,1:length(A.row)] <- A.row # Copy row to next row of final table
  A.id <- tableA$A.id[a]   # get that row id
  B.sub <- tableB[which(tableB$B.link.up == A.id),] # Subset rows in B that link to that row in A
  # B
  for(b in 1:nrow(B.sub)) { # for every row in this SUBSET of B
    B.row <- B.sub[b,]      # extract that row
    final_table[nrow(final_table)+1,1:length(B.row)] <- B.row # Copy to final table
    B.id <- B.sub$B.id[b]   # get that row id
    C.sub <- tableC[which(tableC$C.link.up == B.id),] # Subset rows in C that link to that row in B
    # C
    if(nrow(C.sub) > 0) { # because not every row in every table is linked to from lower tables
      for(c in 1:nrow(C.sub)) {
        C.row <- C.sub[c,]
        final_table[nrow(final_table)+1,1:length(C.row)] <- C.row
        # C.id <- tableC$C.id[c] # don't need - only required for subsetting further lower tables
      }
    }
  }
}

几个注意事项:

  • 是的,我知道最后一次迭代(对于tableC)是多余的--我可以直接将子集复制到最后一个表。我留下它是为了给予你知道兔子洞能有多深。
  • 正如我所说的,这些表有不同的列数,所以rbind不起作用(据我所知),因此设置了最终的表df,并在循环中笨拙地复制到它。
  • 是的,C中没有任何内容与B5相关联;每一行都向上连接,但不是每一行都向下连接
  • 链接ID号不一定如在该示例中那样按数字顺序一一它们可以链接到它们上方的表的 * 任何 * 行。但这并没有给我带来麻烦。
ubby3x7f

ubby3x7f1#

这两种方法都能更快地得到你想要的东西。

选项一

首先是一些预处理:

library(data.table)

# column names of final output
nms <- c("Originating.Table", "Row.id", "Link.upwards",
         "various", "different", "data", "formats")

# create a list of tables ordered hierarchically
tables <- list(tableA = tableA, tableB = tableB, tableC = tableC)
# possibly: tables <- do.call("list", mget(ls(pattern = "table")))
# create a list of tables containing the hierarchical positioning
dts <- vector("list", length(tables))
dts[[1]] <- as.data.table(
  cbind(
    1:nrow(tables[[1]]),
    matrix(0L, nrow(tables[[1]]), length(tables) - 1L)
  )
)
# rename the table according to the final output
setnames(tables[[1]], nms[1:ncol(tables[[1]])])

for (i in 2:length(dts)) {
  # assuming the `link.up` column is always third, copy the positioning of the
  # next table up in the hierarchy
  dts[[i]] <- dts[[i - 1L]][tables[[i]][,3]]
  # the positioning for this level in the hierarchy
  dts[[i]][[i]] <- 1:nrow(tables[[i]])
  # rename the table according to the final output
  setnames(tables[[i]], nms[1:ncol(tables[[i]])])
}

最终表格:

# bind the tables, filling with NA, then reorder based on the hierarchy
# ordering using `dts`
rbindlist(tables, fill = TRUE)[setorder(rbindlist(dts)[,r := .I])$r]
#>     Originating.Table Row.id Link.upwards  various  different data formats
#>  1:                 A      1           NA       10       text <NA>      NA
#>  2:                 B      1            1     text 2023-02-09 1001      55
#>  3:                 C      1            1      501       blah  zzz      NA
#>  4:                 B      2            1 difftext 2023-02-09 1002      56
#>  5:                 C      2            2      502       blah  zzz      NA
#>  6:                 C      3            2      503       blah  zzz      NA
#>  7:                 A      2           NA       20   moretext <NA>      NA
#>  8:                 B      3            2    wtext 2023-02-09 1003      57
#>  9:                 C      4            3      504       blah  zzz      NA
#> 10:                 B      4            2    xtext 2023-02-09 1004      58
#> 11:                 C      5            4      505       blah  zzz      NA
#> 12:                 C      6            4      506       blah  zzz      NA
#> 13:                 C      7            4      507       blah  zzz      NA
#> 14:                 C      8            4      508       blah  zzz      NA
#> 15:                 B      5            2    ytext 2023-02-09 1005      59
#> 16:                 A      3           NA       20  othertext <NA>      NA
#> 17:                 B      6            3    ztext 2023-02-09 1006      60
#> 18:                 C      9            6      509       blah  zzz      NA
#> 19:                 C     10            6      510       blah  zzz      NA

选项二

可能会更慢,但应该更有内存效率。
预处理:

# column names of final output
nms <- c("Originating.Table", "Row.id", "Link.upwards",
         "various", "different", "data", "formats")

# create a list of tables ordered hierarchically
tables <- lapply(
  mget(ls(pattern = "table")), \(x) setnames(setDT(x), nms[1:ncol(x)])
)

# temporarily change `Link.upwards` for the first table
tables[[1]][,Link.upwards := as.list(Row.id)]

i <- unique(
  unlist(
    setDT(
      Reduce(
        \(x, y) {
          y[
            x,
            .(
              Row.id = Row.id,
              Link.upwards = as.list(
                as.data.frame(rbind(as.data.frame(i.Link.upwards), Row.id))
              )
            ), on = .(Link.upwards = Row.id), nomatch = NA
          ]
        },
        tables
      )[[2]]
    ) + c(0L, cumsum(sapply(tables[-length(tables)], nrow))),
    1, 0
  )
)

# change `Link.upwards` back to NA for the first table
tables[[1]][[3]] <- NA

最终表格:

rbindlist(tables, fill = TRUE)[i[!is.na(i)]]
#>     Originating.Table Row.id Link.upwards  various  different data formats
#>  1:                 A      1           NA       10       text <NA>      NA
#>  2:                 B      1            1     text 2023-02-09 1001      55
#>  3:                 C      1            1      501       blah  zzz      NA
#>  4:                 B      2            1 difftext 2023-02-09 1002      56
#>  5:                 C      2            2      502       blah  zzz      NA
#>  6:                 C      3            2      503       blah  zzz      NA
#>  7:                 A      2           NA       20   moretext <NA>      NA
#>  8:                 B      3            2    wtext 2023-02-09 1003      57
#>  9:                 C      4            3      504       blah  zzz      NA
#> 10:                 B      4            2    xtext 2023-02-09 1004      58
#> 11:                 C      5            4      505       blah  zzz      NA
#> 12:                 C      6            4      506       blah  zzz      NA
#> 13:                 C      7            4      507       blah  zzz      NA
#> 14:                 C      8            4      508       blah  zzz      NA
#> 15:                 B      5            2    ytext 2023-02-09 1005      59
#> 16:                 A      3           NA       20  othertext <NA>      NA
#> 17:                 B      6            3    ztext 2023-02-09 1006      60
#> 18:                 C      9            6      509       blah  zzz      NA
#> 19:                 C     10            6      510       blah  zzz      NA

相关问题