我有一个大的数据框,看起来像这样:
df <- data.frame(dive = factor(sample(c("dive1","dive2"), 10, replace=TRUE)),
speed = runif(10)
)
> df
dive speed
1 dive1 0.80668490
2 dive1 0.53349584
3 dive2 0.07571784
4 dive2 0.39518628
5 dive1 0.84557955
6 dive1 0.69121443
7 dive1 0.38124950
8 dive2 0.22536126
9 dive1 0.04704750
10 dive2 0.93561651
我的目标是当一列中的值等于某个值时,获得另一列中的值的平均值,并对所有值重复此操作。例如,在上面的示例中,我希望返回speed
列中每个唯一值的平均值。因此,当dive==dive1
时,speed
的平均值是这样的,对于dive
的每个值依此类推。
9条答案
按热度按时间vlf7wbxs1#
There are many ways to do this in R. Specifically,
by
,aggregate
,split
, andplyr
,cast
,tapply
,data.table
,dplyr
, and so forth.Broadly speaking, these problems are of the form split-apply-combine. Hadley Wickham has written a beautiful article that will give you deeper insight into the whole category of problems, and it is well worth reading. His
plyr
package implements the strategy for general data structures, anddplyr
is a newer implementation performance tuned for data frames. They allow for solving problems of the same form but of even greater complexity than this one. They are well worth learning as a general tool for solving data manipulation problems.Performance is an issue on very large datasets, and for that it is hard to beat solutions based on
data.table
. If you only deal with medium-sized datasets or smaller, however, taking the time to learndata.table
is likely not worth the effort.dplyr
can also be fast, so it is a good choice if you want to speed things up, but don't quite need the scalability ofdata.table
.Many of the other solutions below do not require any additional packages. Some of them are even fairly fast on medium-large datasets. Their primary disadvantage is either one of metaphor or of flexibility. By metaphor I mean that it is a tool designed for something else being coerced to solve this particular type of problem in a 'clever' way. By flexibility I mean they lack the ability to solve as wide a range of similar problems or to easily produce tidy output.
Examples
base
functionstapply
:aggregate
:aggregate
takes in data.frames, outputs data.frames, and uses a formula interface.by
:In its most user-friendly form, it takes in vectors and applies a function to them. However, its output is not in a very manipulable form.:
To get around this, for simple uses of
by
theas.data.frame
method in thetaRifx
library works:split
:As the name suggests, it performs only the "split" part of the split-apply-combine strategy. To make the rest work, I'll write a small function that uses
sapply
for apply-combine.sapply
automatically simplifies the result as much as possible. In our case, that means a vector rather than a data.frame, since we've got only 1 dimension of results.External packages
data.table:
dplyr
:plyr
(the pre-cursor ofdplyr
)Here's what the official page has to say about
plyr
:It’s already possible to do this with
base
R functions (likesplit
and theapply
family of functions), butplyr
makes it all a bit easier with:foreach
packageIn other words, if you learn one tool for split-apply-combine manipulation it should be
plyr
.reshape2:
The
reshape2
library is not designed with split-apply-combine as its primary focus. Instead, it uses a two-part melt/cast strategy to perfor m a wide variety of data reshaping tasks . However, since it allows an aggregation function it can be used for this problem. It would not be my first choice for split-apply-combine operations, but its reshaping capabilities are powerful and thus you should learn this package as well.Benchmarks
10 rows, 2 groups
As usual,
data.table
has a little more overhead so comes in about average for small datasets. These are microseconds, though, so the differences are trivial. Any of the approaches works fine here, and you should choose based on:plyr
is always worth learning for its flexibility;data.table
is worth learning if you plan to analyze huge datasets;by
andaggregate
andsplit
are all base R functions and thus universally available)10 million rows, 10 groups
But what if we have a big dataset? Let's try 10^7 rows split over ten groups.
Then
data.table
ordplyr
using operating ondata.table
s is clearly the way to go. Certain approaches (aggregate
anddcast
) are beginning to look very slow.10 million rows, 1,000 groups
If you have more groups, the difference becomes more pronounced. With 1,000 groups and the same 10^7 rows:
因此,
data.table
可以继续很好地扩展,dplyr
在data.table
上运行也很好,split
/sapply
策略在组数量方面的可伸缩性似乎很差(这意味着split()
可能比较慢,而sapply
比较快)。by
仍然相对高效--5秒的时间对于用户来说肯定是显而易见的,但是对于这么大的数据集来说仍然不是不合理的。data.table
显然是最佳选择- 100%数据表,或dplyr
与dplyr
搭配使用,data.table
作为可行的替代方案。sq1bmfud2#
2015年更新dplyr:
jei2mxaa3#
sbtkgmzw4#
增加了替代的R基方法,在各种情况下都保持快速。
借用@Ari的基准:
10行,2组
1千万行,10个组
1000万行,1000个组
6yjfywim5#
使用新函数
across
:ru9i0ody6#
我们已经有了大量的选择来获得平均分组,从
mosaic
包中再添加一个。这将返回一个命名的数值向量,如果需要,我们可以将其 Package 在
stack
Dataframe 中数据
t1rydlwq7#
使用
collapse
数据
xjreopfe8#
RCchelsie提供的扩展答案-如果有人想获得 Dataframe 中所有列的按组计算的平均值:
lzfw57am9#
对于dplyr
1.1.0
(及以上版本),我们可以使用.by
参数临时分组。这使得代码更短(因为我们避免了
group_by
和ungroup
语句),并且.by
总是返回未分组的 Dataframe 。