按R中的非重复列值求和

brccelvz 于 2023-02-10 发布在其他

关注(0)|答案(5)|浏览(111)

我在R中有一个非常大的 Dataframe ，并且希望对其他列中的每个不同值求和两列，例如，假设我们有一天内各个商店的事务 Dataframe 的数据，如下所示

shop <- data.frame('shop_id' = c(1, 1, 1, 2, 3, 3), 
  'shop_name' = c('Shop A', 'Shop A', 'Shop A', 'Shop B', 'Shop C', 'Shop C'), 
  'city' = c('London', 'London', 'London', 'Cardiff', 'Dublin', 'Dublin'), 
  'sale' = c(12, 5, 9, 15, 10, 18), 
  'profit' = c(3, 1, 3, 6, 5, 9))

即：

shop_id  shop_name    city      sale profit
   1     Shop A       London    12   3
   1     Shop A       London    5    1
   1     Shop A       London    9    3
   2     Shop B       Cardiff   15   6
   3     Shop C       Dublin    10   5
   3     Shop C       Dublin    18   9

我想把每家商店的销售额和利润加起来：

shop_id  shop_name    city      sale profit
   1     Shop A       London    26   7
   2     Shop B       Cardiff   15   6
   3     Shop C       Dublin    28   14

我目前正在使用以下代码来完成此操作：

shop_day <-ddply(shop, "shop_id", transform, sale=sum(sale), profit=sum(profit))
 shop_day <- subset(shop_day, !duplicated(shop_id))

这绝对可以正常工作，但正如我所说，我的 Dataframe 很大（140，000行，37列和近100，000个唯一的行，我想求和），我的代码需要很长时间才能运行，然后最终说它已经耗尽内存。
有人知道最有效的方法吗。
先谢了!

来源：https://stackoverflow.com/questions/11782030/sum-by-distinct-column-value-in-r

5条答案

按热度按时间

thigvfpy1#

强制性数据表答案

> library(data.table)
data.table 1.8.0  For help type: help("data.table")
> shop.dt <- data.table(shop)
> shop.dt[,list(sale=sum(sale), profit=sum(profit)), by='shop_id']
     shop_id sale profit
[1,]       1   26      7
[2,]       2   15      6
[3,]       3   28     14
>

听起来很好，直到事情变大...

shop <- data.frame(shop_id = letters[1:10], profit=rnorm(1e7), sale=rnorm(1e7))
shop.dt <- data.table(shop)

> system.time(ddply(shop, .(shop_id), summarise, sale=sum(sale), profit=sum(profit)))
   user  system elapsed 
  4.156   1.324   5.514 
> system.time(shop.dt[,list(sale=sum(sale), profit=sum(profit)), by='shop_id'])
   user  system elapsed 
  0.728   0.108   0.840 
>

如果您使用以下键创建data.table，则速度会进一步提高：

shop.dt <- data.table(shop, key='shop_id')

> system.time(shop.dt[,list(sale=sum(sale), profit=sum(profit)), by='shop_id'])
   user  system elapsed 
  0.252   0.084   0.336 
>

赞(0）回复(0）举报 2023-02-10

h9a6wy2h2#

我认为最简洁的方法是在dplyr中

library(dplyr)
shop %>% 
  group_by(shop_id, shop_name, city) %>% 
  summarise_all(sum)

赞(0）回复(0）举报 2023-02-10

omtl5h9j3#

下面介绍如何使用基数R来加快运算速度：

idx <- split(1:nrow(shop), shop$shop_id)
a2 <- data.frame(shop_id=sapply(idx, function(i) shop$shop_id[i[1]]),
                 sale=sapply(idx, function(i) sum(shop$sale[i])), 
                 profit=sapply(idx, function(i) sum(shop$profit[i])) )

时间减少到0.75秒对5.70秒的ddply总结版本在我的系统。

赞(0）回复(0）举报 2023-02-10

lvmkulzt4#

以防万一，如果列列表很长，请使用summary_if（）

如果数据类型为int，则汇总所有列

library(dplyr)
shop %>% 
  group_by(shop_id, shop_name, city) %>% 
  summarise_if(is.integer, sum)

赞(0）回复(0）举报 2023-02-10

hivapdat5#

对不起，我的英语不是很好。
我有这样一个数据
第X组A 2 A 1 C 1 B 5 A 2 C 1 C 2 B 5 B 5
我想要一张table给我
按组列出的唯一值的总和，如下所示：
X族
阿3
B五
C 3

赞(0）回复(0）举报 2023-02-10

我来回答

按R中的非重复列值求和

5条答案

如果数据类型为int，则汇总所有列

相关问题

热门标签

最新问答