R:确定一组列中的冗余值和唯一值

yhived7q  于 2022-12-30  发布在  其他
关注(0)|答案(1)|浏览(133)

我希望确定一组列中的值何时是冗余的,将其记录在新的列multi?中,其中0表示只看到一个值,1表示看到多个值。我希望代码忽略它,并相应地评估其他值的冗余性。当值"Unspecified"是列集中唯一的值时,我希望列multi?记录"Unspecified"
值得注意的是,这四列只是具有更多列的更大数据库的一部分。
为了说明我的意思,我提供了下面的输入和输出示例:

headbleed_type_dx1 headbleed_type_dx2 headbleed_type_dx3 headbleed_type_dx4
1      Intracerebral      Intracerebral      Intracerebral               <NA>      
2      Intracerebral      Subarachnoid                <NA>           Subdural      
3        Unspecified      Intracerebral           Subdural      Intracerebral      
4        Unspecified               <NA>                <NA>               <NA>               
5               <NA>               <NA>                <NA>               <NA>

如果Multi?的行为1,那么我还想记录新列Number中唯一值的数量

Multi?       Number
1 0            1
2 1            3
3 1            2
4 Unspecified  1
5 NA           NA
5uzkadbs

5uzkadbs1#

这真的很麻烦,我建议不要在一列中混合数字和字符。

library(dplyr)

data %>% 
  rowwise() %>% 
  summarise(
    number = n_distinct(
      c_across(headbleed_type_dx1:headbleed_type_dx4), 
      na.rm = TRUE),
    unspec = coalesce(
      any(c_across(headbleed_type_dx1:headbleed_type_dx4) == "Unspecified"), 
      FALSE)) %>% 
  mutate(
    number2 = if_else(number > 1L & unspec, number - 1L, na_if(number, 0)),
    multi = case_when(number == 1 & unspec ~ "Unspecific",
                      number2 == 1 ~ "0",
                      is.na(number2) ~ NA_character_,
                      TRUE ~ "1"),
    .keep = "none") %>% 
  select(number = number2, multi)

这将返回

# A tibble: 6 × 2
  number multi     
   <int> <chr>     
1      1 0         
2      3 1         
3      2 1         
4      1 Unspecific
5     NA NA        
6      1 0

数据

structure(list(headbleed_type_dx1 = c("Intracerebral", "Intracerebral", 
"Unspecified", "Unspecified", NA, "Intracerebral"), headbleed_type_dx2 = c("Intracerebral", 
"Subarachnoid", "Intracerebral", NA, NA, "Unspecified"), headbleed_type_dx3 = c("Intracerebral", 
NA, "Subdural", NA, NA, "Intracerebral"), headbleed_type_dx4 = c(NA, 
"Subdural", "Intracerebral", NA, NA, NA)), problems = structure(list(
    row = 1:4, col = c(NA_character_, NA_character_, NA_character_, 
    NA_character_), expected = c("4 columns", "4 columns", "4 columns", 
    "4 columns"), actual = c("5 columns", "5 columns", "5 columns", 
    "5 columns"), file = c("literal data", "literal data", "literal data", 
    "literal data")), row.names = c(NA, -4L), class = c("tbl_df", 
"tbl", "data.frame")), class = c("spec_tbl_df", "tbl_df", "tbl", 
"data.frame"), row.names = c(NA, -6L), spec = structure(list(
    cols = list(headbleed_type_dx1 = structure(list(), class = c("collector_character", 
    "collector")), headbleed_type_dx2 = structure(list(), class = c("collector_character", 
    "collector")), headbleed_type_dx3 = structure(list(), class = c("collector_character", 
    "collector")), headbleed_type_dx4 = structure(list(), class = c("collector_character", 
    "collector"))), default = structure(list(), class = c("collector_guess", 
    "collector")), skip = 1L), class = "col_spec"))

相关问题