用dplyr对多个列按组连接字符串[重复]

sg3maiej 于 2023-05-11 发布在其他

关注(0)|答案(2)|浏览(89)

此问题已在此处有答案：

Collapse all columns by an ID column [duplicate]（5个答案）
6年前关闭。
嗨，我需要串联字符串的多列组。我意识到这个问题的不同版本已经被问过好几次了（参见Aggregating by unique identifier and concatenating related values into a string），但它们通常涉及到连接单个列的值。
我的数据集是这样的：

Sample  group   Gene1   Gene2   Gene3
A       1       a       NA      NA
A       2       b       NA      NA
B       1       NA      c       NA
C       1       a       NA      d
C       2       b       NA      e
C       3       c       NA      NA

我想把它转换成一种格式，每个样本只占1行（group列是可选的）：

Sample  group   Gene1   Gene2   Gene3
A       1,2     a,b     NA      NA
B       1       NA      c       NA
C       1,2,3   a,b,c   NA      d,e

由于基因的数量可以达到数千个，所以我不能简单地指定希望连接的列。我知道aggregate或dplyr可以用来获取组，但我不知道如何对多列执行此操作。
先谢谢你了！

编辑

由于我的数据集非常大，包含数千个基因，我意识到dplyr太慢了。我一直在试验data.table，下面的代码也可以得到我想要的：

setDT(df)[, lapply(.SD, function(x) paste(na.omit(x), collapse = ",")), by = Sample]

现在的输出是：

Sample group Gene1 Gene2 Gene3
1:      A   1,2   a,b            
2:      B     1           c      
3:      C 1,2,3 a,b,c         d,e

谢谢你的帮助！

来源：https://stackoverflow.com/questions/42288757/concatenate-strings-by-group-with-dplyr-for-multiple-columns

2条答案

按热度按时间

wljmcqd81#

为此，有summarise_all、summarise_at和summarise_if函数。使用summarise_all：

df %>%
  group_by(Sample) %>%
  summarise_all(funs(paste(na.omit(.), collapse = ",")))

# A tibble: 3 × 5
  Sample group Gene1 Gene2 Gene3
   <chr> <chr> <chr> <chr> <chr>
1      A   1,2   a,b            
2      B     1           c      
3      C 1,2,3 a,b,c         d,e

更新：在当前版本的dplyr中，鼓励将summarise与across结合使用，例如：就像这样：

df %>%
  group_by(Sample) %>%
  summarise(across(everything(), \(x) paste(na.omit(x), collapse = ",")))

赞(0）回复(0）举报 2023-05-11

czq61nw12#

使用dplyr，您可以尝试：

dft %>%
  group_by(Sample) %>%
  summarise_each(funs( toString(unique(.))))

其给出：

# A tibble: 3 × 5
  Sample   group   Gene1 Gene2    Gene3
   <chr>   <chr>   <chr> <chr>    <chr>
1      A    1, 2    a, b    NA       NA
2      B       1      NA     c       NA
3      C 1, 2, 3 a, b, c    NA d, e, NA

编辑：@Axeman使用na.omit(.)来摆脱空值的想法是正确的

赞(0）回复(0）举报 2023-05-11

我来回答

用dplyr对多个列按组连接字符串[重复]

编辑

2条答案

相关问题

热门标签

最新问答