基于拆分大 Dataframe 的列值创建新 Dataframe

zdwk9cvp  于 2022-12-25  发布在  其他
关注(0)|答案(3)|浏览(134)

我尝试将一个非常大的 Dataframe 拆分为多个 Dataframe ,其中列值以“-----. I.”开头,直到到达“-----. I I”。
我有一个包含超过190万行的大 Dataframe (通过阅读文本文件创建,然后准备成df),

a = c("pass-100.0","pass-100.0","pass-100.0","pass-100.0","pass-100.0","----------------------- NET XI50|XI1|XI15|net311","X","garbage","pass-100.0","pass-100.0","pass-100.0","pass-100.0","pass-100.0","----------------------- NET XI50|XI1|XI15|net311","Y","garbage","pass-100.0","pass-100.0","pass-100.0","pass-100.0","pass-100.0")
b = c("r321_1096","r321_1098","r321_1097","r321_1095","r321_1093","-------------------------------","Z","garbage","r321_1096","r321_1098","r321_1097","r321_1095","r321_1093","-------------------------------","P","garbage","r321_1096","r321_1098","r321_1097","r321_1095","r321_1093")
c = c("0.04","0.04","-1","0.04","-1","","Q","garbage","0.04","0.04","-1","0.04","-1","","R","garbage","0.04","0.04","-1","0.04","-1")
d = c("0.32","0.32","0","0.32","0","","S","garbage","0.32","0.32","0","0.32","0","","T","garbage","0.32","0.32","0","0.32","0")
df = data.frame(a, b, c, d)

我希望将df拆分为较小的 Dataframe ,其中列“a”以“-------------------- NET XI 50”开头|十一1|十一十五|net 311”,当遇到“---------------------- NET XI 50”时, Dataframe 结束|十一1|十一十五|net 311”进行第二次访问。
我想保留所有新 Dataframe 的列名a、B和c,并删除较小 Dataframe 中包含“垃圾”值的行。稍后,我计划在这些 Dataframe 上进行一些常见计算。但我想不出一种方法来将大 Dataframe 拆分为可用的 Dataframe 。有什么想法吗?

wwtsj6pe

wwtsj6pe1#

你可以使用split函数来完成这个任务,它依赖于为每一组创建一个对应的因子,你可以使用cumsum来计算A列中与----匹配的行数。结果是一个 Dataframe 列表(我还过滤掉了“垃圾”行):

split(df[df$a!="garbage",] , cumsum(grepl("----",df[df$a!="garbage","a"])))

$`0`
           a         b    c    d
1 pass-100.0 r321_1096 0.04 0.32
2 pass-100.0 r321_1098 0.04 0.32
3 pass-100.0 r321_1097   -1    0
4 pass-100.0 r321_1095 0.04 0.32
5 pass-100.0 r321_1093   -1    0

$`1`
                                                  a                               b    c    d
6  ----------------------- NET XI50|XI1|XI15|net311 -------------------------------          
7                                                 X                               Z    Q    S
9                                        pass-100.0                       r321_1096 0.04 0.32
10                                       pass-100.0                       r321_1098 0.04 0.32
11                                       pass-100.0                       r321_1097   -1    0
12                                       pass-100.0                       r321_1095 0.04 0.32
13                                       pass-100.0                       r321_1093   -1    0

$`2`
                                                  a                               b    c    d
14 ----------------------- NET XI50|XI1|XI15|net311 -------------------------------          
15                                                Y                               P    R    T
17                                       pass-100.0                       r321_1096 0.04 0.32
18                                       pass-100.0                       r321_1098 0.04 0.32
19                                       pass-100.0                       r321_1097   -1    0
20                                       pass-100.0                       r321_1095 0.04 0.32
21                                       pass-100.0                       r321_1093   -1    0
lnxxn5zx

lnxxn5zx2#

我建议你这样做:

df |> 
  mutate(group = cumsum(a |> str_starts("---"))) |> 
  nest(data = -group) |> 
  rowwise(group) |> 
  summarise(
    data = list(
      if (group == 0) data else slice_tail(data, n = -3)
    )
  )

#> # A tibble: 3 × 2
#>   group data            
#>   <int> <list>          
#> 1     0 <tibble [5 × 4]>
#> 2     1 <tibble [5 × 4]>
#> 3     2 <tibble [5 × 4]>

**说明:**首先,我们创建一个列,用于指示每个新区块的开始时间

df |> 
  mutate(
    group = cumsum(a |> str_starts("----"))
  ) 
#> # A tibble: 21 × 5
#>    a                                                b          c     d     group
#>    <chr>                                            <chr>      <chr> <chr> <int>
#>  1 pass-100.0                                       r321_1096  "0.0… "0.3…     0
#>  2 pass-100.0                                       r321_1098  "0.0… "0.3…     0
#>  3 pass-100.0                                       r321_1097  "-1"  "0"       0
#>  4 pass-100.0                                       r321_1095  "0.0… "0.3…     0
#>  5 pass-100.0                                       r321_1093  "-1"  "0"       0
#>  6 ----------------------- NET XI50|XI1|XI15|net311 ---------… ""    ""        1
#>  7 X                                                Z          "Q"   "S"       1
#>  8 garbage                                          garbage    "gar… "gar…     1
#>  9 pass-100.0                                       r321_1096  "0.0… "0.3…     1
#> 10 pass-100.0                                       r321_1098  "0.0… "0.3…     1
#> # … with 11 more rows

现在你可以将这些组嵌套到单独的 Dataframe 中

df |> 
  mutate(
    group = cumsum(a |> str_starts("----"))
  ) |> 
  nest(data = -group)
#> # A tibble: 3 × 2
#>   group data            
#>   <int> <list>          
#> 1     0 <tibble [5 × 4]>
#> 2     1 <tibble [8 × 4]>
#> 3     2 <tibble [8 × 4]>

然后可以使用行方式和汇总方式的组合来处理

df |> 
  mutate(
    group = cumsum(a |> str_starts("----"))
  ) |> 
  nest(data = -group) |> 
  rowwise(group) |> 
  summarise(
    data = list(
      if (group == 0) data else slice_tail(data, n = -3)
    )
  )
#> # A tibble: 3 × 2
#>   group data            
#>   <int> <list>          
#> 1     0 <tibble [5 × 4]>
#> 2     1 <tibble [5 × 4]>
#> 3     2 <tibble [5 × 4]>
lp0sw83n

lp0sw83n3#

或者使用group_split(实验性的,但很方便)的dplyr等价物。
但是,我建议您在创建数据框之前在txt文件中执行以下操作:对于190万行来说,它(可能)更容易、更安全、更快。

library(dplyr)
library(stringr)

df |>
  group_by(group = cumsum(str_detect(a,
                                     fixed("----------------------- NET XI50|XI1|XI15|net311")))) |> # Fixed is faster but approximate
  filter(between(row_number(), 4, n()) & group > 0 | group == 0) |>
  group_split(.keep = FALSE)

输出:

[[1]]
# A tibble: 5 × 4
  a          b         c     d    
  <chr>      <chr>     <chr> <chr>
1 pass-100.0 r321_1096 0.04  0.32 
2 pass-100.0 r321_1098 0.04  0.32 
3 pass-100.0 r321_1097 -1    0    
4 pass-100.0 r321_1095 0.04  0.32 
5 pass-100.0 r321_1093 -1    0    

[[2]]
# A tibble: 5 × 4
  a          b         c     d    
  <chr>      <chr>     <chr> <chr>
1 pass-100.0 r321_1096 0.04  0.32 
2 pass-100.0 r321_1098 0.04  0.32 
3 pass-100.0 r321_1097 -1    0    
4 pass-100.0 r321_1095 0.04  0.32 
5 pass-100.0 r321_1093 -1    0    

[[3]]
# A tibble: 5 × 4
  a          b         c     d    
  <chr>      <chr>     <chr> <chr>
1 pass-100.0 r321_1096 0.04  0.32 
2 pass-100.0 r321_1098 0.04  0.32 
3 pass-100.0 r321_1097 -1    0    
4 pass-100.0 r321_1095 0.04  0.32 
5 pass-100.0 r321_1093 -1    0

相关问题