R语言 使用一个数据文件中的值选择另一个数据文件中的行

n6lpvg4x  于 2023-01-18  发布在  其他
关注(0)|答案(3)|浏览(323)
    • bounty将在5天后过期**。回答此问题可获得+100声望奖励。socialresearcher希望引起更多人关注此问题。

我有来自多个参与者的数据,我正尝试使用一组文件中的信息来聚合另一组文件中的数据。
我有一个包含四列的数据文件(videodata.csv):Event.Type(视频名称)、Name(开始或结束)、Row(行号)和Source(参与者ID代码)。设置如下:

videodata <- data.frame(
  Event.Type = c("Movie1", "Movie1", "Movie2", "Movie2", "Movie3", "Movie3",
                 "Movie1", "Movie1", "Movie2", "Movie2", "Movie3", "Movie3",
                 "Movie1", "Movie1", "Movie2", "Movie2", "Movie3", "Movie3"),
  Name = c("start", "end", "start", "end","start", "end",
           "start", "end", "start", "end","start", "end",
           "start", "end", "start", "end","start", "end"),
  Row = c(1, 10, 11, 20, 21, 30,
           8, 16, 19, 28, 1, 7,
           2, 8, 21, 26, 9, 20),
  Source = c("012", "012", "012", "012", "012", "012",
             "013", "013", "013", "013", "013", "013",
             "014", "014", "014", "014", "014", "014")
)

我有像80个参与者,所以他们每个人都有另一个csv格式的数据文件(id_data.csv),有两列:timescore。每行对应一秒,因此time列简单地从1编号到大约1,800或其他数字。score01的二进制代码。对于参与者012、013和014,数据如下所示:

`012_data` <- data.frame(
  time = c(1:30),
  score = c(0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)
)

`013_data` <- data.frame(
  time = c(1:30),
  score = c(0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0)
)

`014_data` <- data.frame(
  time = c(1:30),
  score = c(0,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)
)

我想使用来自videodata.csv的信息来分割id_data.csv,并将它们保存到与Event.Type对应的单独文件中。
具体来说,我将在videodata.csv文件中查找id号。对于给定的id号,我将从第一个Event.Type开始(例如电影1)并找到开始row和结束row。我将使用此信息转到该ID号的数据文件,选择score的那些行,并将它们保存到名为movie1.csv的文件中。我将对每个参与者重复,直到movie1.csv包含所有参与者的数据。
例如,对于参与者012,我希望将id_data.csv文件分成三个块,分别对应于Movie1、Movie2和Movie3。为此,我将:
1.我将选择行1 - 10(基于RowRow表示Start和End),方法是向上舍入到最接近的行号
1.将score列保存到名为Movie1的文件中,列标题012来自Source
1.对参与者013重复上述步骤,并将其score列追加到上一行旁边。
1.然后对参与者014重复,依此类推,直到80。
Movie1的结果文件可能如下所示:

它包括行1 - 10(对于012)、行8 - 16(对于013)和行2 - 8(对于014)中的score值。
然后,我将拥有另一个对应于Movie2的文件(通过选择第11 - 20、19 - 28和21 - 26行),依此类推,每个参与者的ID可能有稍微不同的行数。
这超出了我的技能,我将感谢任何帮助,我如何才能完成这一点。谢谢!

wnvonmuf

wnvonmuf1#

由于我假设每个_data.csv文件中的行数可能不同,因此我假设按以下方式合并数据:

data_list <- list()
data_list[["012_data"]] <- `012_data`
data_list[["013_data"]] <- `013_data`
data_list[["014_data"]] <- `014_data`

请注意,这里是一个列表而不是 Dataframe ,因为数据结构和行计数可能会有变化。在应用于list.fileslapply函数中阅读文件时可以达到同样的效果。
然后我们用这些文件中的数据创建“长”数据.frame。

rm(dl)

library(dplyr)
library(tidyr)

for (filename in names(data_list)) {
  if (exists("dl")) {
    dl <- dl %>% 
      union_all(
        data_list[[filename]] %>% 
          mutate(fileName = strsplit(filename, "_")[[1]][1]) # Here we use base R function
      )
  } else {
    dl <- data_list[[filename]] %>% 
      mutate(fileName = strsplit(filename, "_")[[1]][1])
  }
}

所有这些步骤都可以在文件上传时完成。
dl较长,因此便于进一步操作:

> dl
   time score fileName
1     1     0      012
2     2     0      012
3     3     0      012
4     4     0      012
5     5     1      012
...

其余所有工作都在一根管道中完成:

videodata %>% 
  pivot_wider(names_from = Name, values_from = Row) %>% 
  left_join(dl, by = c("Source" = "fileName")) %>% 
  filter(time >= start& time <= end) %>% 
  select(-c(start, end, time)) %>% 
  group_by(Event.Type, Source) %>% 
  mutate(id = row_number()) %>% 
  pivot_wider(names_from = Source, values_from = score)

您可以看到逐步执行管道的中间结果。管道返回以下输出:

# A tibble: 32 × 5
# Groups:   Event.Type [3]
   Event.Type    id `012` `013` `014`
   <chr>      <int> <dbl> <dbl> <dbl>
 1 Movie1         1     0     0     1
 2 Movie1         2     0     1     0
 3 Movie1         3     0     0     0
 4 Movie1         4     0     0     0
 5 Movie1         5     1     1     1
 6 Movie1         6     0     0     0
 7 Movie1         7     0     0     1
 8 Movie1         8     0     0    NA
 9 Movie1         9     1     1    NA
10 Movie1        10     0    NA    NA
11 Movie2         1     0     0     0
12 Movie2         2     0     0     0
13 Movie2         3     0     0     0
14 Movie2         4     0     0     0
15 Movie2         5     0     0     0
16 Movie2         6     0     0     0
17 Movie2         7     0     0    NA
18 Movie2         8     0     0    NA
19 Movie2         9     0     0    NA
20 Movie2        10     0     0    NA
21 Movie3         1     0     0     0
22 Movie3         2     0     0     0
23 Movie3         3     0     0     0
24 Movie3         4     0     0     0
25 Movie3         5     0     0     0
26 Movie3         6     0     0     0
27 Movie3         7     0     0     0
28 Movie3         8     0    NA     0
29 Movie3         9     0    NA     0
30 Movie3        10     0    NA     0
31 Movie3        11    NA    NA     0
32 Movie3        12    NA    NA     0

此代码的另一个优点是,它不包含对电影名称的任何引用,因此可以像其他答案中那样无需修改地进行缩放。它对所选项目数量的差异具有鲁棒性(参见电影1 == 10和电影3 == 12中的元素数量)。如果012的元素数量小于右列013、014等的元素数量,它也将正常工作

disbfnqx

disbfnqx2#

假设您的id数据位于目录data\id中,并且它们都具有相同的2列timescore,我们可以使用readr::read_csv()一次读取所有这些文件,同时指定id参数,该参数将文件路径保存为一列。然后,我们可以拆分该文件路径,只保留我们想要的部分,以获得与视频数据匹配的Source列。
然后,我们可以连接Source列上的数据,但必须在对两列进行宽透视之后。连接好数据后,我们可以对它们进行长透视,并通过使用start和end参数查找时间来定义一个新的Event.Type列。最后,我们只选择所需的列,并再次对它们进行宽透视,以获得所需的输出。
这将仍然包含所有的电影,但您可以轻松地添加一个dplyr::filter(Event.Type == "Movie1")来获得唯一的文件。

library(tidyverse)
library(here)

# Here I take your dummy data and write them as csv's

`012_data` <- data.frame(
  time = c(1:30),
  score = c(0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)
) %>% write_csv(here::here('data', 'id', '012_data.csv'))

`013_data` <- data.frame(
  time = c(1:30),
  score = c(0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0)
) %>% write_csv(here::here('data', 'id', '013_data.csv'))

`014_data` <- data.frame(
  time = c(1:30),
  score = c(0,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)
) %>% write_csv(here::here('data', 'id', '014_data.csv'))

# Read them back in all at once into a single tibble

id_data <- here::here('data', 'id') %>% 
  list.files(full.names = T) %>% 
  readr::read_csv(id = 'Source') %>% 
  dplyr::mutate(Source = stringr::str_split_i(Source, pattern = "\\/", i = -1),
                Source = stringr::str_remove(Source, pattern = "_data.csv")) %>% 
  tidyr::pivot_wider(names_from = time, values_from = score)

videodata <- data.frame(
  Event.Type = c("Movie1", "Movie1", "Movie2", "Movie2", "Movie3", "Movie3",
                 "Movie1", "Movie1", "Movie2", "Movie2", "Movie3", "Movie3",
                 "Movie1", "Movie1", "Movie2", "Movie2", "Movie3", "Movie3"),
  Name = c("start", "end", "start", "end","start", "end",
           "start", "end", "start", "end","start", "end",
           "start", "end", "start", "end","start", "end"),
  Row = c(1, 10, 11, 20, 21, 30,
          8, 16, 19, 28, 1, 7,
          2, 8, 21, 26, 9, 20),
  Source = c("012", "012", "012", "012", "012", "012",
             "013", "013", "013", "013", "013", "013",
             "014", "014", "014", "014", "014", "014")
)  %>% 
  tidyr::pivot_wider(names_from = c(Event.Type,Name), values_from = Row)

df <- dplyr::left_join(videodata, id_data) %>% 
  tidyr::pivot_longer(cols = 8:37, names_to = "time", values_to = "score") %>%
  dplyr::mutate(
    time = as.numeric(time),
    Event.Type = dplyr::case_when(
      time >= Movie1_start & time <= Movie1_end ~ 'Movie1',
      time >= Movie2_start & time <= Movie2_end ~ 'Movie2',
      time >= Movie3_start & time <= Movie3_end ~ 'Movie3',
      T ~ as.character(NA)),
    score = ifelse(is.na(Event.Type), NA, score)) %>%
  dplyr::select(Event.Type, Source, score) %>% 
  tidyr::drop_na() %>%
  dplyr::group_by(Event.Type, Source) %>%
  dplyr::mutate(id = row_number()) %>% 
  tidyr::pivot_wider(names_from = Source, values_from = score)

df
# A tibble: 32 × 5
# Groups:   Event.Type [3]
   Event.Type    id `012` `013` `014`
   <chr>      <int> <dbl> <dbl> <dbl>
 1 Movie1         1     0     0     1
 2 Movie1         2     0     1     0
 3 Movie1         3     0     0     0
 4 Movie1         4     0     0     0
 5 Movie1         5     1     1     1
 6 Movie1         6     0     0     0
 7 Movie1         7     0     0     1
 8 Movie1         8     0     0    NA
 9 Movie1         9     1     1    NA
10 Movie1        10     0    NA    NA
# … with 22 more rows
# ℹ Use `print(n = ...)` to see more rows

更新:我的更新答案从@asd-tm得到了启发。他们的答案总体上更好,但也许你可以使用两者的一些信息,比如从我这里阅读文件,从他们那里清理数据。

5vf7fwbs

5vf7fwbs3#

我设法生成了你想要的文件。请检查代码中的注解以获得更深入的解释。
我认为最后一个代码块(双循环)在时间上可能不是最有效的解决方案,但它是我能想到的“最干净”的解决方案。有时候为了易于理解而牺牲一些性能会更好。

### load necessary packages
library(data.table)

### read file from disk (I used your data)
videodata = fread("videodata.csv")
    
### read individual files from disk. If not exactly 80, adjust
### the number. If the filename pattern doesn't match exactly,
### adjust accordingly. (I used your data)

individualFileNames = paste(sprintf("%03d", 1:80), "data.csv", sep = "_")

### files are read into a **list**
individualData = lapply(individualFileNames, fread)

### we need this because of your naming convention :'(
### In general, it's bad idea to name files starting with a number
names(individualData) = paste(sprintf("%03d", 1:80), "_data", sep = "")
    
### a function to pad with NAs the vectors of different lengts
padNAs = function(x){
  n = max(sapply(x, length))
  for (i in 1 : length(x)){
    length(x[[i]])= n
  }
  as.data.table(x)
}

### This double loop will go over each event type (movie) and
### each source in it, and create a table of the scores (which
### are obtained from each individual data table). The table
### is then exported as a csv file with the name of the movie.

for (i in videodata[, unique(Event.Type)]){
  j_score = list()
  counter = 1
  for (j in videodata[, unique(Source)]){
    filename = paste0(i, ".csv")
    startRow = videodata[Event.Type == i & Source == j & Name == "start", Row]
    endRow = videodata[Event.Type == i & Source == j & Name == "end", Row]
    j_score[[counter]] = individualData[paste0(j, "_data")][[1]][startRow : endRow, score]
    names(j_score)[counter] = j
    counter = counter + 1
  }
  fwrite(padNAs(j_score), file = filename)
}

相关问题