使用melt或pivot_longer对2行变量

cl25kdpy  于 2023-05-20  发布在  其他
关注(0)|答案(2)|浏览(150)

我有这个输出,其中包含一个基因家族中的值(例如family01,family21),但它也包含一些与这些站点相关联的元数据(环境类型)。我知道数据并不像我希望的那样整洁,所以我想知道是否有一种优雅的方法可以使用pivot_longer以长格式整理这些数据,以便稍后制作热图。
这里是数据集。

Sample  bel buc mal man pen poc tej
Environment C_3 C_1 C_3 C_3 C_2 C_1 C_3
family01    1.962   20.790  0.000   0.000   0.000   0.000   1.962
family03    0.000   3.150   0.000   0.000   152.614 0.089   0.000
family08    4.482   12.603  0.168   0.460   0.000   2.917   4.482
family13    0.000   1.697   0.000   0.000   169.841 0.000   0.000
family17    0.462   10.689  0.000   0.000   0.000   1.387   0.462
family21    0.000   0.410   0.000   0.000   122.959 0.000   0.000

我搜索了函数pivot_longer的帮助,有几个问题涉及多个 * 列 * 而不是 * 行 *,也不是关于“同时”将函数应用于两组变量。首先,我尝试了long <- pivot_longer(data = output.txt, cols = -c(Family), names_to = "site", values_to = "rpkm"),但我得到了这个结果

# A tibble: 49 x 3
   Sample      site  rpkm  
   <chr>       <chr> <chr> 
 1 Environment bel   C_3    
 2 Environment buc   C_1    
 3 Environment mal   C_3    
 4 Environment man   C_3    
 5 Environment pen   C_2    
 6 Environment poc   C_1    
 7 Environment tej   C_3    
 8 family1     bel   1.962 
 9 family1     buc   20.790
10 family1     mal   0.000

这是预期的格式

Family  site  Env  rpkm
  family1   bel  3   1.962      
  family1   buc  1   20.790 
  family1   mal  3   0.000

我想如果第一排不在的话

output.txt %>%
  pivot_longer(
    -Environment,
    names_to = c(".value", "Env"),
    names_sep = "_")

为了得到一个列'环境'与环境类型存储为一个变量,然后我需要'另一个'转换为长格式。
我想象两个连续的pivot_longer回合可以解决这个问题,或者使用其他策略与melt(例如现场和现场类型作为前2行)。这是一个很长的表,所以我想避免排序或手动编辑它。

dput的输出:

structure(list(Sample = c("Environment", "family01", "family03", 
"family08", "family13", "family17", "family21"), bel = c("C__3", 
"1.962", "0", "4.482", "0", "0.462", "0"), buc = c("C_1", "20.79", 
"3.15", "12.603", "1.697", "10.689", "0.41"), mal = c("C_3", 
"0", "0", "0.168", "0", "0", "0"), man = c("C_3", "0", "0", "0.46", 
"0", "0", "0"), pen = c("C_2", "0", "152.614", "0", "169.841", 
"0", "122.959"), poc = c("C_1", "0", "0.089", "2.917", "0", "1.387", 
"0"), tej = c("C_3", "1.962", "0", "4.482", "0", "0.462", "0"
)), row.names = c(NA, -7L), class = c("tbl_df", "tbl", "data.frame"
))

addendum如果有人想使用这个解决方案,我不得不使用一个解决方案,保存(xlsx或csv),再次加载tibble,并指定值(rpkm)是数字。

lrpiutwd

lrpiutwd1#

这里有一个选项,我们可以通过paste ing更改除第一个以外的列名(str_c)删除C_后的第一行元素,然后删除第一行(slice(-1)),将列的type转换为type.convert,将“wide”整形为“long”(pivot_longer),separate通过在小写字母和数字之间的边界处拆分,将“name”列分为两个((?<=[a-z])(?=\\d)- regex lookaround)

library(dplyr)
library(tidyr)
library(stringr)
output.txt %>% 
    rename_at(-1, ~ str_c(., unlist(output.txt %>% 
                                       slice(1) %>% 
                                       select(-Sample) %>% 
                                       unlist %>% 
                                       str_remove('C_+')))) %>% 
    slice(-1) %>%
    type.convert(as.is = TRUE) %>%
    pivot_longer(cols = -Sample, values_to = 'rpkm') %>% 
    separate(name, into = c('site', 'Env'), sep='(?<=[a-z])(?=\\d)')
# A tibble: 42 x 4
#   Sample   site  Env    rpkm
#   <chr>    <chr> <chr> <dbl>
# 1 family01 bel   3      1.96
# 2 family01 buc   1     20.8 
# 3 family01 mal   3      0   
# 4 family01 man   3      0   
# 5 family01 pen   2      0   
# 6 family01 poc   1      0   
# 7 family01 tej   3      1.96
# 8 family03 bel   3      0   
# 9 family03 buc   1      3.15
#10 family03 mal   3      0   
# … with 32 more rows
wwtsj6pe

wwtsj6pe2#

这里有另一个解决方案类似于你的想法。基本上,我正在将第一行重塑为更长的格式,并对其余行执行相同的操作;然后我在site列上加入这两个。

library(dplyr)
library(tidyr)
library(stringr)

output.txt %>% 
  slice(1) %>% 
  pivot_longer(-Sample, names_to = "site", values_to = "Env") %>% 
  mutate(Env = str_remove(Env, 'C_+'))  %>% 
  select(-Sample) %>% 
 right_join(., 
             {output.txt %>% 
                  slice(-1) %>% 
                  pivot_longer(-Sample, names_to = "site", values_to = "rpkm")},
            by = "site") %>% 
 select(Family = Sample, site, Env, rpkm)

#> # A tibble: 42 x 4
#>    Family   site  Env   rpkm 
#>    <chr>    <chr> <chr> <chr>
#>  1 family01 bel   3     1.962
#>  2 family01 buc   1     20.79
#>  3 family01 mal   3     0    
#>  4 family01 man   3     0    
#>  5 family01 pen   2     0    
#>  6 family01 poc   1     0    
#>  7 family01 tej   3     1.962
#>  8 family03 bel   3     0    
#>  9 family03 buc   1     3.15 
#> 10 family03 mal   3     0    
#> # ... with 32 more rows

相关问题