将同一数据集多个版本的相似变量编译成R中的一个数据集

yqhsw0fo  于 2022-12-30  发布在  其他
关注(0)|答案(2)|浏览(92)

我正在使用同一调查数据集的多个版本。这是一个单独级别的调查,要求(几乎)每年一次,但问题/覆盖范围可能会发生变化。此外,虽然大多数wave中会提出类似的问题,但问题ID已发生变化。我需要将数据集编译为一个累积数据集。为此,我希望仅提取感兴趣的问题,并重新编码,使所有变量的结果相同。
我为自己编制了一个索引,列出每年每项问题的涵盖范围,现将其中一部分内容载列如下:

Edition P           S        A       E     F
1995    p33         s1       s2     s17   s20
1996    p40         s1       s2     s14   s16a
1997    sp58        s1       s2     s10   s12a
1998    sp53        s1       s2     s11   s14a
2000    P54ST       S1       S2      S6  reeduc1
2001    p55st       s1       s2      s6  reeduc1
2002    p45st       s1       s2      s6  reeduc1
2003    p54st       s1       s2      s6    s18
2004    p30st       s1       s2      s6   reeduc1
2005    p48st       s6       s7     s11   reeduc1
2006    p38st       s6       s7     s11   reeduc1
2007    p64st       s10      s11    s15   reeduc1
2008    p61st       s8       s9     s15   reeduc1
2009    P35ST       s5       s6     s12   reeduc1
2010    P29ST       S7       S8     S14   REEDUC1
2011    P38ST       S16      S17    S21   REEDUC1
2013    P22TGBSM    S10      S11    S17   REEDUC_1
2015    P23TGBSM    S12      S13    S19   REEDUC_1
2016    P15STGBS    SEXO    EDAD    S13   REEDUC_1
2017    P16STGBS    SEXO    EDAD    S14   REEDUC.1
2018    P21STGBS.A  SEXO    EDAD    S10 
2020    P50STGBS.A  SEXO    EDAD    S16   REEDUC.1

“版本”列是版本(波)。其他列是在整个数据集中询问的所有问题,我已经为这些问题的名称分配了字母,这些字母可以成为每个问题的新的标准化列名称。在这些列名称下面是每个相应的调查波内的所有单个变量名称。注意,在特定的调查波中没有询问一些问题,因此它们在关键词中留空。例如,2018年变量F的上述关键词中差距是一个缺失的问题。我仍然需要将该数据集包含在累积数据集中,因此我需要输入该问题和调查波(以及特定波中的任何其他缺失问题)的NA值。
假设每个文件都是一个名为“surveyXXXX”的.csv文件(带有版本年份)。将每个调查波中的所需问题合并到一个累积数据集中,并将所有回答保留为单个回答的最简单、最干净的方法是什么?是否有一种方法可以简化该过程,以便可以对我的密钥中包含的所有所需问题进行无缝编码?
来自调查2020.csv数据部分的示例数据

structure(list(numinves = c(2020L, 2020L, 2020L, 2020L, 2020L
), idenpa = c(32L, 32L, 32L, 32L, 32L), numentre = 3200001:3200005, 
    reg = c(32002L, 32001L, 32002L, 32002L, 32211L), ciudad = c(32301917L, 
    32001001L, 32301955L, 32301932L, 32211004L), tamciud = c(-4L, 
    -4L, -4L, -4L, -4L), comdist = c(2L, 1L, 280L, 230L, 926L
    ), edad = c(63L, 24L, 20L, 54L, 38L), sexo = c(2L, 1L, 1L, 
    2L, 1L), codigo = c(-4L, -4L, -4L, -4L, -4L)), row.names = c("1", 
"2", "3", "4", "5"), class = "data.frame")

如您所见,变量值均为数值。所有勘测波均如此,因为数值对应于外部码本。显然,并非所有所需变量均位于 Dataframe 的开头,也不是以每个波的相同顺序出现。

ecfdbz9o

ecfdbz9o1#

我刚刚看到了zephryl的答案,但我已经做过了,所以这里有一个不基于tidyverse的解决方案:

data_stack <- data.frame(
      Edition = c(1995, 1996, 1997, 1998, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2013, 2015, 2016, 2017, 2018, 2020),
      P = c("p33", "p40", "sp58", "sp53", "P54ST", "p55st", "p45st", "p54st", "p30st", "p48st", "p38st", "p64st", "p61st", "P35ST", "P29ST", "P38ST", "P22TGBSM", "P23TGBSM", "P15STGBS", "P16STGBS", "P21STGBS.A", "P50STGBS.A"),
      S = c("s1", "s1", "s1", "s1", "S1", "s1", "s1", "s1", "s1", "s6", "s6", "s10", "s8", "s5", "S7", "S16", "S10", "S12", "SEXO", "SEXO", "SEXO", "SEXO"),
      A = c("s2", "s2", "s2", "s2", "S2", "s2", "s2", "s2", "s2", "s7", "s7", "s11", "s9", "s6", "S8", "S17", "S11", "S13", "EDAD", "EDAD", "EDAD", "EDAD"),
      E = c("s17", "s14", "s10", "s11", "S6", "s6", "s6", "s6", "s6", "s11", "s11", "s15", "s15", "s12", "S14", "S21", "S17", "S19", "S13", "S14", "S10", "S16"),
      `F` = c("s20", "s16a", "s12a", "s14a", "reeduc1", "reeduc1", "reeduc1", "s18", "reeduc1", "reeduc1", "reeduc1", "reeduc1", "reeduc1", "reeduc1", "REEDUC1", "REEDUC1", "REEDUC_1", "REEDUC_1", "REEDUC_1", "REEDUC.1", NA_character_, "REEDUC.")
      )
    
    edition_2015 <- data.frame(P23TGBSM = c(3, 4, 5),
                           S12 = c(6, 7, 8))
    
    edition_2016 <- data.frame(P15STGBS = c(10, 11, 12),
                           SEXO = c(13, 14, 15))
     
    edition_2015
    edition_2016
    
    for (year in 2015:2016) {
      
      edition <- get(paste0("edition_", year))
      
      for (i in 1:ncol(edition)) {
        colnames(edition)[i] <- colnames(data_stack[data_stack$Edition == year,])[(which(as.matrix(data_stack[data_stack$Edition == year,]) == colnames(edition)[i], arr.ind = TRUE))[2]]
      }
      assign(paste0("edition_", year), edition)
    }
    
    edition_2015
    edition_2016
eqqqjvef

eqqqjvef2#

下面是一个基于tidyverse的解决方案:

library(tidyr)
library(purrr)
library(readr)

# Convert key into a list of named character vectors; one per year, with
# original names as values and new names as names
keys_by_year <- key %>% 
  pivot_longer(!Edition) %>% 
  drop_na(value) %>% 
  split(.$Edition) %>% 
  map(\(df) purrr::set_names(df$value, df$name))

# import surveys, naming each with its year
surveys_raw <- map(
  purrr::set_names(c("1995", "2018")),
  \(year) read_csv(paste0("survey", year, ".csv"))
)

# for each survey, select variables wanted by indexing with the key for that
# year, then rename using names from key
surveys <- imap_dfr(
  surveys_raw, 
  \(survey, year) {
    year_key <- keys_by_year[[year]]
    survey <- survey[year_key]
    colnames(survey) <- names(year_key)
    survey
  },
  .id = "Edition"
  )

# result
surveys
# A tibble: 6 × 6
  Edition     P     S     A     E     F
  <chr>   <dbl> <dbl> <dbl> <dbl> <dbl>
1 1995        3     4     2     4     4
2 1995        5     3     5     1     5
3 1995        2     1     4     3     5
4 2018        2     1     1     1    NA
5 2018        4     1     4     1    NA
6 2018        5     1     3     3    NA
  • 示例数据:*
set.seed(13)

# including a few "unwanted" columns in each
survey1995 <- data.frame(
  p33 = sample(1:5, 3, replace = TRUE),
  p35 = sample(1:5, 3, replace = TRUE),
  s1 = sample(1:5, 3, replace = TRUE),
  s2 = sample(1:5, 3, replace = TRUE),
  s3 = sample(1:5, 3, replace = TRUE),
  s17 = sample(1:5, 3, replace = TRUE),
  s20 = sample(1:5, 3, replace = TRUE)
)

survey2018 <- data.frame(
  P21STGBS.A = sample(1:5, 3, replace = TRUE),
  SEXO = sample(1:5, 3, replace = TRUE),
  EDAD = sample(1:5, 3, replace = TRUE),
  IDIOMA = sample(1:5, 3, replace = TRUE),
  S10 = sample(1:5, 3, replace = TRUE)
)

write_csv(survey1995, "survey1995.csv")
write_csv(survey2018, "survey2018.csv")

相关问题