我尝试将一个非常大的 Dataframe 拆分为多个 Dataframe ,其中列值以“-----. I.”开头,直到到达“-----. I I”。
我有一个包含超过190万行的大 Dataframe (通过阅读文本文件创建,然后准备成df),
a = c("pass-100.0","pass-100.0","pass-100.0","pass-100.0","pass-100.0","----------------------- NET XI50|XI1|XI15|net311","X","garbage","pass-100.0","pass-100.0","pass-100.0","pass-100.0","pass-100.0","----------------------- NET XI50|XI1|XI15|net311","Y","garbage","pass-100.0","pass-100.0","pass-100.0","pass-100.0","pass-100.0")
b = c("r321_1096","r321_1098","r321_1097","r321_1095","r321_1093","-------------------------------","Z","garbage","r321_1096","r321_1098","r321_1097","r321_1095","r321_1093","-------------------------------","P","garbage","r321_1096","r321_1098","r321_1097","r321_1095","r321_1093")
c = c("0.04","0.04","-1","0.04","-1","","Q","garbage","0.04","0.04","-1","0.04","-1","","R","garbage","0.04","0.04","-1","0.04","-1")
d = c("0.32","0.32","0","0.32","0","","S","garbage","0.32","0.32","0","0.32","0","","T","garbage","0.32","0.32","0","0.32","0")
df = data.frame(a, b, c, d)
我希望将df拆分为较小的 Dataframe ,其中列“a”以“-------------------- NET XI 50”开头|十一1|十一十五|net 311”,当遇到“---------------------- NET XI 50”时, Dataframe 结束|十一1|十一十五|net 311”进行第二次访问。
我想保留所有新 Dataframe 的列名a、B和c,并删除较小 Dataframe 中包含“垃圾”值的行。稍后,我计划在这些 Dataframe 上进行一些常见计算。但我想不出一种方法来将大 Dataframe 拆分为可用的 Dataframe 。有什么想法吗?
3条答案
按热度按时间wwtsj6pe1#
你可以使用
split
函数来完成这个任务,它依赖于为每一组创建一个对应的因子,你可以使用cumsum
来计算A列中与----
匹配的行数。结果是一个 Dataframe 列表(我还过滤掉了“垃圾”行):lnxxn5zx2#
我建议你这样做:
**说明:**首先,我们创建一个列,用于指示每个新区块的开始时间
现在你可以将这些组嵌套到单独的 Dataframe 中
然后可以使用行方式和汇总方式的组合来处理
lp0sw83n3#
或者使用
group_split
(实验性的,但很方便)的dplyr
等价物。但是,我建议您在创建数据框之前在txt文件中执行以下操作:对于190万行来说,它(可能)更容易、更安全、更快。
输出: