如何pivot_wider/melt Excel工作表中具有多个标题行的值矩阵?

nfs0ujit  于 2023-06-19  发布在  其他
关注(0)|答案(3)|浏览(117)

我有很多Excel表格,每个表格都有大约120个实验室结果和150个病人的矩阵。然而,实验室结果存储为从第4行开始的行,并且患者与额外的标题列数据一起存储在第1至第3行中。Excel工作表看起来像这样:

我想读取它并将其转换为更长/规范化的 Dataframe ,基本上就像这个模型:

我认为它可以用pivot_longermelt处理,但左上角的空单元格会引起麻烦,即实验室结果的列标题从第4行开始,患者列从C列开始。
顺便说一句,患者列名总是从D列的位置开始,但显然,实际名称EXA_2665EXA_2707等事先并不知道,它们将是新发布的患者ID。
这是我目前得到的代码。

library(tidyverse)
library(readxl)
library(dplyr)
library(tidyr)

##xls_data <- read_excel(path = "lab2_exafinal.xlsx", sheet = "Labata")
## TEST hardcoded data for testing purposes
xls_data2 <- read.table(text = "
...1,...2,Patient-Id,EXA_2665,EXA_2707,EXA_2971
,,Tube,87019054,87065976,87093024
,,Tube-Pos,C12,D01,D02
LabId,LabDescr,WeightFact,,,
9,HbA1c,1.00,1.784,0.419,1.045
37,HDL-Chol,1.00,0.684,2.172,1.954
38,LDL-Chol,1.00,2.438,1.995,0.806
41,Glucose,1.00,1.366,1.47,0.978
", header = TRUE, sep = ",", na.strings = "")  

library(reshape)
df_melt <- melt(xls_data, id.vars=c("...1", "...2"))
# error: names do not match previous names

pivoted_data <- pivot_longer(
  xls_data,
  cols = names(xls_data[3,]),
  names_to = "patient",
  values_to = "value" 
  )
#no error but not quite the desired result

编辑:@Jon-Spring建议使用dput共享read_excel函数阅读Excel文件所产生的确切格式/数据结构(顺便说一下,它也将1.045读取为1.044999等,但现在这是一个单独的问题)顺便说一下,还注意到患者标签名称可能包含-字符。

#xls_data <- read_excel(path = "lab2_exafinal.xlsx", sheet = "Labata")
#dput(xls_data)
xls_data <- structure(
  list(
    ...1 = c(NA, NA, "LabId", "9", "37", "38", "41"),
    ...2 = c(NA, NA, "LabDescr", "HbA1c", "HDL-Chol", "LDL-Chol", "Glucose"),
    ...3 = c("Tube", "Tube-Pos", "WeightFact", "1,00", "1,00", "1,00", "1,00"),
    EXA_2665 = c("87019054", "C12", NA, "1.784", "0.68400000000000005", "2.4380000000000002", "1.3660000000000001"),
    EXA_2707 = c("87065976", "D01", NA, "0.41899999999999998", "2.1720000000000002", "1.9950000000000001", "1.47"),
    EXA_2971 = c("87093024", "D02", NA, "1.0449999999999999", "1.954", "0.80600000000000005", "0.97799999999999998")
    ),
    class = c("tbl_df", "tbl", "data.frame"),
    row.names = c(NA, -7L)
  )

当我使用readxl::read_excel读取Excel工作表时,第一列将是...1...2。因此,当使用melt时,我 * 认为 * 这些应该是id.vars参数(?),但它给出了一个错误信息。pivot_longer的结果看起来更接近我想要做的,但问题是有额外的标题行和我不知道如何使用(或只是删除?))那些。
我可以想到其他的方法来做到这一点,但我不知道 * 如何 * 在Rstudio中做到这一点:

  • 尝试读取两个独立的 Dataframe ,患者(C1..F3)和实验室值(A4..F8),然后合并。
  • 或者首先将患者标题名称EXA_2665等复制到空行4,然后仅重新解释A4..F8区域

还有什么其他的办法来解决这个问题吗?

sxpgvts3

sxpgvts31#

unheadr包提供了一些工具,可以使用这些特性来重塑表,这些特性可能会使表对人类读者来说更加紧凑,但用代码解析起来更加棘手。
在这里,沿着你的第二个想法,我们可以合并四行标题(第一行作为标题,接下来的三行作为数据行)。我指定-作为分隔符,因为_已经在PatientId中作为非分隔符出现。
然后,我们可以将这些组合的标题pivot_longer到每个值旁边的多个names列中。

library(unheadr)
xls_data2 |>
  # take the first three rows and combine them with the header
  mash_colnames(n_name_rows = 3, sep = "-") |>
  # reshape everything besides the first three columns, separating the 
  # combined headers into new columns named
  # 1) the third original header
  # 2-4) the first three data rows in the third column
  pivot_longer(-c(1:3), names_sep = "-", 
               names_to = c(colnames(xls_data2)[3], xls_data2[1:3, 3])) |>
  dplyr::rename(LabId = 1, LabDescr = 2, WeightFact2 = 3) 
  # dplyr::rename to avoid conflict with `reshape::rename`

结果

# A tibble: 12 × 8
   LabId LabDescr WeightFact2 PatientId Tube     TubePos WeightFact value
   <chr> <chr>    <chr>       <chr>     <chr>    <chr>   <chr>      <chr>
 1 9     HbA1c    1.00        EXA_2665  87019054 C12     NA         1.784
 2 9     HbA1c    1.00        EXA_2707  87065976 D01     NA         0.419
 3 9     HbA1c    1.00        EXA_2971  87093024 D02     NA         1.045
 4 37    HDL-Chol 1.00        EXA_2665  87019054 C12     NA         0.684
 5 37    HDL-Chol 1.00        EXA_2707  87065976 D01     NA         2.172
 6 37    HDL-Chol 1.00        EXA_2971  87093024 D02     NA         1.954
 7 38    LDL-Chol 1.00        EXA_2665  87019054 C12     NA         2.438
 8 38    LDL-Chol 1.00        EXA_2707  87065976 D01     NA         1.995
 9 38    LDL-Chol 1.00        EXA_2971  87093024 D02     NA         0.806
10 41    Glucose  1.00        EXA_2665  87019054 C12     NA         1.366
11 41    Glucose  1.00        EXA_2707  87065976 D01     NA         1.47 
12 41    Glucose  1.00        EXA_2971  87093024 D02     NA         0.978

编辑dput(xls_data)示例:

xls_data |>
  mash_colnames(n_name_rows = 3, sep = "@") |>
  pivot_longer(-c(1:3), names_sep = "@", 
               names_to = c(colnames(xls_data2)[3], xls_data2[1:3, 3])) |>
  dplyr::rename(LabId = 1, LabDescr = 2, WeightFact2 = 3) |>
  dplyr::mutate(across(value, as.numeric))

结果

# A tibble: 12 × 8
   LabId LabDescr WeightFact2 PatientId Tube     TubePos WeightFact value
   <chr> <chr>    <chr>       <chr>     <chr>    <chr>   <chr>      <dbl>
 1 9     HbA1c    1,00        EXA_2665  87019054 C12     NA         1.78 
 2 9     HbA1c    1,00        EXA_2707  87065976 D01     NA         0.419
 3 9     HbA1c    1,00        EXA_2971  87093024 D02     NA         1.04 
 4 37    HDL-Chol 1,00        EXA_2665  87019054 C12     NA         0.684
 5 37    HDL-Chol 1,00        EXA_2707  87065976 D01     NA         2.17 
 6 37    HDL-Chol 1,00        EXA_2971  87093024 D02     NA         1.95 
 7 38    LDL-Chol 1,00        EXA_2665  87019054 C12     NA         2.44 
 8 38    LDL-Chol 1,00        EXA_2707  87065976 D01     NA         2.00 
 9 38    LDL-Chol 1,00        EXA_2971  87093024 D02     NA         0.806
10 41    Glucose  1,00        EXA_2665  87019054 C12     NA         1.37 
11 41    Glucose  1,00        EXA_2707  87065976 D01     NA         1.47 
12 41    Glucose  1,00        EXA_2971  87093024 D02     NA         0.978
icnyk63a

icnyk63a2#

如果你喜欢它,下面有一个powerquery方法。它是动态的,可以容纳任意数量的患者ID数据列
把数据带进来,不带任何标题

let Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
#"Keep Rows" = Table.FirstN(Table.RemoveColumns(Source,{"Column1", "Column2"}),3),
#"Promoted Headers" = Table.PromoteHeaders(Table.Transpose(#"Keep Rows"), [PromoteAllScalars=true]),
#"Added Custom" = Table.AddColumn(#"Promoted Headers", "Custom", each {1 .. Table.RowCount(Source)-4}),
Part1 = Table.ExpandListColumn(#"Added Custom", "Custom"),
#"Promoted Headers2" = Table.PromoteHeaders(Table.Skip(Source,3), [PromoteAllScalars=true]),
leading=3, groupsof=1,
#"Added Custom1" = Table.AddColumn(#"Promoted Headers2", "Custom", each List.Split( List.RemoveFirstN(Record.ToList( _),leading), groupsof) ),
#"Added Custom2" = Table.AddColumn(#"Added Custom1", "Custom0", each Text.Combine(List.Transform(List.FirstN(Record.ToList(_),leading), each Text.From(_)),"|")),
#"Removed Other Columns" = Table.SelectColumns(#"Added Custom2",{"Custom0", "Custom"}),
#"Expanded Custom" = Table.ExpandListColumn( #"Removed Other Columns", "Custom"),
#"Extracted Values" = Table.TransformColumns(#"Expanded Custom", {"Custom", each Text.Combine(List.Transform(_, Text.From), "|"), type text}),
#"Merged Columns" = Table.CombineColumns(#"Extracted Values",{"Custom0", "Custom"},Combiner.CombineTextByDelimiter("|", QuoteStyle.None),"Custom"),
#"Split Column by Delimiter" = Table.SplitColumn(#"Merged Columns", "Custom", Splitter.SplitTextByDelimiter("|", QuoteStyle.Csv), List.FirstN(Table.ColumnNames(Source),leading+groupsof)),
Part2 = Table.RenameColumns(#"Split Column by Delimiter",{{"Column1", "LabID"}, {"Column2", "LabDescr"}, {"Column3", "WeightFact"}, {"Column4", "Value"}}),
Combine=Table.FromColumns(Table.ToColumns(Part2) & Table.ToColumns(Part1),Table.ColumnNames(Part2)&Table.ColumnNames(Part1)),
#"Removed Columns2" = Table.RemoveColumns(Combine,{"Custom"})
in #"Removed Columns2"

t40tm48m

t40tm48m3#

如果你想要一个“纯”tidyverse方法,你可以使用以下方法:

library(dplyr)
library(tidyr)

xls_data <- structure(
   list(
      ...1 = c(NA, NA, "LabId", "9", "37", "38", "41"),
      ...2 = c(NA, NA, "LabDescr", "HbA1c", "HDL-Chol", "LDL-Chol", "Glucose"),
      ...3 = c("Tube", "TubePos", "WeightFact", "1,00", "1,00", "1,00", "1,00"),
      EXA_2665 = c("87019054", "C12", NA, "1.784", "0.68400000000000005", "2.4380000000000002", "1.3660000000000001"),
      EXA_2707 = c("87065976", "D01", NA, "0.41899999999999998", "2.1720000000000002", "1.9950000000000001", "1.47"),
      EXA_2971 = c("87093024", "D02", NA, "1.0449999999999999", "1.954", "0.80600000000000005", "0.97799999999999998")
   ),
   class = c("tbl_df", "tbl", "data.frame"),
   row.names = c(NA, -7L)
)

# 1. Get start of the data
data_start <- xls_data %>%
   select(first_col = 1) %>%
   summarize(start = which(!is.na(first_col))[1L] + 1L) %>% 
   pull(start)

# 2. Extract the 2 data blocks

values <- xls_data %>% 
   slice(data_start:n()) %>% 
   set_names(
      c(
         xls_data %>% 
            slice(data_start - 1L) %>% 
            select(1:3) %>% 
            unlist(),
         names(.)[-(1:3)]
      )
   ) %>% 
   mutate(across(-(2:3), as.numeric)) 

meta <-  xls_data %>% 
   slice(1:(data_start - 2L)) %>% 
   select(-(1:2)) %>% 
   set_names(c("PatientId", names(.)[-1L])) %>% 
   bind_rows(names(.) %>% 
                as.list() %>% 
                set_names(.) %>% 
                as.data.frame(),
             .
   ) %>% 
   t() %>% 
   `colnames<-`(.[1L, ]) %>% 
   as_tibble() %>% 
   slice(-1L)   

# 3. Reshape and join
values %>% 
   pivot_longer(-(1:3)) %>% 
   inner_join(meta,
              c(name = "PatientId")) %>% 
   rename(PatientId = name) %>% 
   select(LabId:PatientId, Tube:TubePos, Value = value) %>% 
   arrange(PatientId, LabId)

诚然,比John的优秀提案长得多的代码,但一个很好的(和冗长的)练习来掌握tidyverse;)

相关问题