所以我在R中运行了一个很大的df(〉5.7百万行),它有以下col_names()
列,day_of_week中有“Fri”,“Wed”等值......你知道,就是天。
[1] "ride_id" "rideable_type" "started_at" "ended_at" "member_casual" "date"
[7] "month" "day" "year" "day_of_week" "ride_length"
在某些情况下,我需要使用以下代码查看一些聚合统计信息:
aggregate(ride_length ~ member_casual + day_of_week, data = df, FUN = sum)
结果_1:
member_casual day_of_week ride_length
1 casual Fri 9461812 mins
2 member Fri 5962077 mins
3 casual Mon 8198659 mins
4 member Mon 5853290 mins
5 casual Sat 15505160 mins
6 member Sat 6303482 mins
7 casual Sun 13364709 mins
8 member Sun 5485445 mins
9 casual Thu 8023285 mins
10 member Thu 6646226 mins
11 casual Tue 6859473 mins
12 member Tue 6300679 mins
13 casual Wed 6901511 mins
14 member Wed 6488745 mins
我想重新安排一周中的日子,从无序到正常的顺序(星期日、星期一......星期六),使它像这样:
结果_2:
member_casual day_of_week ride_length
1 casual Sun 13364709 mins
2 member Sun 5485445 mins
3 casual Mon 8198659 mins
4 member Mon 5853290 mins
5 casual Tue 6859473 mins
6 member Tue 6300679 mins
7 casual Wed 6901511 mins
8 member Wed 6488745 mins
9 casual Thu 8023285 mins
10 member Thu 6646226 mins
11 casual Fri 9461812 mins
12 member Fri 5962077 mins
13 casual Sat 15505160 mins
14 member Sat 6303482 mins
为了实现result_2,我尝试了以下两种方法:
1.有管道(%〉%):
df$day_of_week <- df %>%
ordered(day_of_week, levels = c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"))
2.无管道:
df$day_of_week <- ordered(df$day_of_week, levels=c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"))
第一段代码花了超过1.5个小时才运行完(我不得不点击STOP来中断,所有的等待时间都浪费了),而第二段代码只花了不到1秒。同样为了更清楚,我使用的是R Studio Desktop,我所有的数据都保存在我的PC上。所以我猜这不是互联网问题。
所以我有四组问题:
1.有没有人遇到过类似的问题?为什么会有区别?
1.如果这样的问题是由使用管道引起的,这是否意味着在某些情况下管道是首选的,而在其他情况下您应该避免使用管道?
1.一般来说,$有什么用,什么时候用?
1.在我的aggregate()输出结果为result_2之后,我偶然发现原来的df不是以Sun开头的行,然后是Mon...实际上,df的行顺序根本没有改变!为什么呢?如果我想把df排列成aggreagre()result_2,我应该怎么做?
提前感谢!任何建议将不胜感激!
1条答案
按热度按时间3htmauhk1#
你能试试这个吗?