R -数据清理和变异错误

wfveoks0  于 2023-03-05  发布在  其他
关注(0)|答案(2)|浏览(98)

我已经回顾了Error: Problem with mutate() column (...) must be size 15 or 1, not 17192How to drop columns with column names that contain specific string?Remove columns that contain a specific word以及相关的错误诊断。
我有一个大型数据集,包含不同地区不同物种的病毒数据-样本数据如下

Country    ..2  Area    Site    ID      Species Sample    Original Sample/Specimen #
<chr>     <lgl> <chr>   <chr>   <chr>   <chr>   <chr>    <chr>
Tanzania    NA  UMNP    UMNPhq  AATPH   PG     Feces    AATPHF2 
Tanzania    NA  UMNP    UMNPhq  AATPI   PG     Feces    AATPIF2 
Tanzania    NA  UMNP    UMNPhq  AATPJ   PG     Feces    AATPJF2 
Tanzania    NA  UMNP    UMNPhq  ATTPK   PG     Feces    ATTPKF2 
Tanzania    NA  UMNP    UMNPhq  AATPL   PG     Feces    AATPLF2 

Filovirus (MOD) PCR  Date (Filo MOD)
<chr>                <date>
Indeterminant        2015-03-16
Indeterminant        2015-03-16
Indeterminant        2015-03-16
Indeterminant        2015-03-16
Negative             2015-03-16

我正在尝试重新编码每个样本ID的病毒状态,阳性或阴性(此处仅为丝状病毒,但有很多丝状病毒,因此请帮助更一般地编码)
代码我已经尝试-第一子集数据只包括一个特定的领域

viral <- subset(data, Area %in% "UMNP")

在这里,我删除了不需要的列,然后能够获得感染状态,但它将样本上的所有其他信息转换为“NA”,导致在我尝试维护这些值时出现额外的错误代码。

viralres <- viral %>% 
     dplyr::select(-matches(c('Performed by ()', 'performed by', 'Date of', '1Performed by', 'Performed by', "Date ()", "...2"),)) %>%
    mutate_if(is.character, ~case_when(. == "Indeterminant" ~ "0", 
                                       . == "Negative" ~ "0", 
                                       . == "Positive" ~ "1"))

数据输出

structure(list(Country = c("Tanzania", "Tanzania", "Tanzania", 
"Tanzania", "Tanzania"), ...2 = c(NA, NA, NA, NA, NA), Area = c("UMNP", 
"UMNP", "UMNP", "UMNP", "UMNP"), Site = c("UMNPhq", "UMNPhq", 
"UMNPhq", "UMNPhq", "UMNPhq"), `Animal ID` = c("AATPH", "AATPI", 
"AATPJ", "ATTPK", "AATPL"), Species = c("Procolobus gordonorum", 
"Procolobus gordonorum", "Procolobus gordonorum", "Procolobus gordonorum", 
"Procolobus gordonorum"), `Sample Type` = c("Feces", "Feces", 
"Feces", "Feces", "Feces"), `Original Sample/Specimen #` = c("AATPHF2", 
"AATPIF2", "AATPJF2", "ATTPKF2", "AATPLF2"), `Filovirus (MOD) PCR` = c("Indeterminant", 
"Indeterminant", "Indeterminant", "Indeterminant", "Negative"
), `Date (Filo MOD)` = structure(c(16510, 16510, 16510, 16510, 
16510), class = "Date")), row.names = c(NA, -5L), class = c("tbl_df", 
"tbl", "data.frame"))
piah890a

piah890a1#

使用mutate_if(is.character, ...)将更改所有字符列。看起来您尝试更改的唯一列是“Filovirus(MOD)PCR”。因此您可以将命令更改为

viral %>% 
  dplyr::select(-matches(c('Performed by ()', 'performed by', 'Date of', '1Performed by', 'Performed by', "Date ()", "...2"),)) %>%
  mutate(across(`Filovirus (MOD) PCR`, ~case_when(. == "Indeterminant" ~ "0", 
                                     . == "Negative" ~ "0", 
                                     . == "Positive" ~ "1")))

以获得最小的更改量。这样,您只更改了该列。或者,您可以使用case_match更直接地更改该单列

viral %>% 
  dplyr::select(-matches(c('Performed by ()', 'performed by', 'Date of', '1Performed by', 'Performed by', "Date ()", "...2"),)) %>%
  mutate(`Filovirus (MOD) PCR` = case_match(`Filovirus (MOD) PCR`,"Indeterminant" ~ "0", 
                                     "Negative" ~ "0", 
                                     "Positive" ~ "1"))

请注意,case_match是在dplyr 1.1.0中引入的

t1rydlwq

t1rydlwq2#

使用mutate_at代替mutate_if

viralres <- viral %>% 
     dplyr::select(-matches(c('Performed by ()', 'performed by', 'Date of', '1Performed by', 'Performed by', "Date ()", "...2"),)) %>%
     mutate_at(c("Filovirus (MOD) PCR"), ~case_when(. == "Indeterminant" ~ "0",
                                                    . == "Negative" ~ "0", 
                                                    . == "Positive" ~ "1"))

mutate_at的第一个参数中,将所有样本ID(丝状病毒等)添加到一个向量中。

相关问题