R语言 如何通过创建额外的列来将多个答案分隔在一列中,对于多列

bmp9r5qi  于 2023-03-27  发布在  其他
关注(0)|答案(1)|浏览(97)

1.数据

我有调查数据:

dat <- structure(list(ID = c(4, 5), Start_time = structure(c(1676454186, 
1676454173), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
    End_time = structure(c(1676454352, 1676454642), class = c("POSIXct", 
    "POSIXt"), tzone = "UTC"), `want_to_change Mult answ` = c("Yes (for the environment), because it provided a starting point to collectively do something about energy consumption.;", 
    "Yes (because of the gas crisis), because it provided a starting point to collectively do something. ;"
    ), actually_changed = c("Yes, I tried to use less energy in the office.", 
    "No, not at all."), `control Mult answ` = c("We / I can control the lights.;Closing/opening doors and windows.;", 
    "We / I can control the lights.;Closing/opening doors and windows.;"), `measures_taken Mult answ` = c("Yes, I checked for lights that were not turned off.; Yes, went home early", 
    "Yes, I checked for lights that were not turned off.;")), row.names = c(NA, 
-2L), class = c("data.table", 
"data.frame"))

如下所示:

2.数据结构

某些列可以有多个答案。这些列的列名中有"Mult answ"。例如,请参见第1行第6列(dat[1,6])。

> dat[1,6]
                                                    control Mult answ
1: We / I can control the lights.;Closing/opening doors and windows.;

3.提问

我想写一段代码:
1.将所有只出现一次的答案更改为Other(这是因为有许多自定义答案)。
1.为每个答案选项创建一个单独的列,并带有通用后缀。

4.我尝试过的

我想我会首先选择有多个答案的列:

# Get columns with more than one answer
temp <- select(dat,contains("Mult answ"))
cols_with_more_answers <- names(temp)

然后我想用分号把列分开(在我计数它们并把唯一的列改为other之前)。

# Separate columns 
tidyr::separate(data.frame(text = dat), text, into = c("A", "B", "C"), sep = ";", fill = "right", extra = "drop")

我该怎么继续?

5.期望输出

dat <- structure(list(ID = c(4, 5), 
                       Start_time = structure(c(1676454186, 1676454173), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
                       End_time = structure(c(1676454352, 1676454642), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
                       `want_to_change Mult answ` = c("Other", "Other"), 
                       actually_changed = c("No, not at all.", "Yes, I tried to use less energy in the office."), 
                       `control Mult answ A` = c("We / I can control the lights.", "We / I can control the lights."), 
                       `control Mult answ B` = c("Closing/opening doors and windows", "Closing/opening doors and windows"), 
                       `measures_taken Mult answ A` = c("Yes, I checked for lights that were not turned off.", "Yes, I checked for lights that were not turned off."), 
                       `measures_taken Mult answ B` = c(NA, "Yes, went home early")), 
                  row.names = c(NA, -2L), 
                  class = c("data.table", "data.frame"))

83qze16e

83qze16e1#

你可以做这样的事情。(将问题转换为字母,并使其稳定,以防你有超过26个答案是有点棘手,但我找到了一种方法绕过它)
我在代码中留下了一些注解,简而言之:

  • 将多答案问题透视到行中,并使用separate_rows分隔答案。
  • 此时,您可以将只出现一次的答案替换为forcats::fct_lump_min
  • 然后你可以创建一个新的列来将答案转换成字母(为此我必须创建一个函数values2letters来调用expand_letters。第一个函数只是简单地将答案重新编码成字母。第二个函数创建字母。如果你有超过26个答案,字母就不够了,所以这个函数会组合字母)。
  • 最后,你把答案分散在它自己的问题和相应的字母的组合上,得到预期的结果。
library(dplyr)
library(tidyr)

expand_letters <- function(l){
  
  # how many times letters must repeat?
  x <- ceiling(log(l, 26))
  
  # correct in case of zero
  x <- max(x,1)

  # repeat the letters
  x <- rep(list(LETTERS), x)
  
  # get combinations
  x <- expand.grid(x)
  
  # collapse letters
  x <- do.call(paste0, rev(x))
  
  # return only the needed ones
  x[seq_len(l)]
  
}

values2letters <- function(x){
  
  x <- factor(x)
  levels <- levels(x)
  l <- length(levels)
  new_levels <- expand_letters(l)
  recode <- setNames(levels, new_levels)
  as.character(forcats::fct_recode(x, !!!recode))
  
}

dat %>%
  
  # pivot only multi answers
  pivot_longer(ends_with("Mult answ")) %>% 
  
  # separate by ; in multiple lines
  separate_rows(value, sep = ";") %>% 
  
  # remove empty rows (automatically created at the end beacuse lines ends with ;)
  filter(value != "") %>% 

  # change to Other if appears less than 2
  mutate(value = as.character(forcats::fct_lump_min(value, 2))) %>%
  
  # recode to letters by question
  group_by(name) %>% 
  mutate(valueLetters = values2letters(value)) %>% 
  ungroup() %>% 
  
  # distinct in case you have multiple "Other"
  distinct() %>%

  # spread values
  pivot_wider(names_from = c(name, valueLetters), values_from = value, names_sep = " ")
#> # A tibble: 2 x 9
#>      ID Start_time          End_time            actual~1 want_~2 contr~3 contr~4
#>   <dbl> <dttm>              <dttm>              <chr>    <chr>   <chr>   <chr>  
#> 1     4 2023-02-15 09:43:06 2023-02-15 09:45:52 Yes, I ~ Other   We / I~ Closin~
#> 2     5 2023-02-15 09:42:53 2023-02-15 09:50:42 No, not~ Other   We / I~ Closin~
#> # ... with 2 more variables: `measures_taken Mult answ B` <chr>,
#> #   `measures_taken Mult answ A` <chr>, and abbreviated variable names
#> #   1: actually_changed, 2: `want_to_change Mult answ A`,
#> #   3: `control Mult answ B`, 4: `control Mult answ A`

创建于2023年3月20日,使用reprex v2.0.2

相关问题