R语言 运行明显相似的代码时处理速度的巨大差异-a是使用$而不是管道造成的吗?

pftdvrlh  于 2022-12-20  发布在  其他
关注(0)|答案(1)|浏览(114)

所以我在R中运行了一个很大的df(〉5.7百万行),它有以下col_names()列,day_of_week中有“Fri”,“Wed”等值......你知道,就是天。

[1] "ride_id"       "rideable_type" "started_at"    "ended_at"      "member_casual" "date"         
 [7] "month"         "day"           "year"          "day_of_week"   "ride_length"

在某些情况下,我需要使用以下代码查看一些聚合统计信息:

aggregate(ride_length ~ member_casual + day_of_week, data = df, FUN = sum)

结果_1:

member_casual day_of_week   ride_length
1         casual         Fri  9461812 mins
2         member         Fri  5962077 mins
3         casual         Mon  8198659 mins
4         member         Mon  5853290 mins
5         casual         Sat 15505160 mins
6         member         Sat  6303482 mins
7         casual         Sun 13364709 mins
8         member         Sun  5485445 mins
9         casual         Thu  8023285 mins
10        member         Thu  6646226 mins
11        casual         Tue  6859473 mins
12        member         Tue  6300679 mins
13        casual         Wed  6901511 mins
14        member         Wed  6488745 mins

我想重新安排一周中的日子,从无序到正常的顺序(星期日、星期一......星期六),使它像这样:

结果_2:

member_casual day_of_week   ride_length
1         casual         Sun 13364709 mins
2         member         Sun  5485445 mins
3         casual         Mon  8198659 mins
4         member         Mon  5853290 mins
5         casual         Tue  6859473 mins
6         member         Tue  6300679 mins
7         casual         Wed  6901511 mins
8         member         Wed  6488745 mins
9         casual         Thu  8023285 mins
10        member         Thu  6646226 mins
11        casual         Fri  9461812 mins
12        member         Fri  5962077 mins
13        casual         Sat 15505160 mins
14        member         Sat  6303482 mins

为了实现result_2,我尝试了以下两种方法:

1.有管道(%〉%):

df$day_of_week <- df %>%
     ordered(day_of_week, levels = c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"))

2.无管道:

df$day_of_week <- ordered(df$day_of_week, levels=c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"))

第一段代码花了超过1.5个小时才运行完(我不得不点击STOP来中断,所有的等待时间都浪费了),而第二段代码只花了不到1秒。同样为了更清楚,我使用的是R Studio Desktop,我所有的数据都保存在我的PC上。所以我猜这不是互联网问题。
所以我有四组问题:
1.有没有人遇到过类似的问题?为什么会有区别?
1.如果这样的问题是由使用管道引起的,这是否意味着在某些情况下管道是首选的,而在其他情况下您应该避免使用管道?
1.一般来说,$有什么用,什么时候用?
1.在我的aggregate()输出结果为result_2之后,我偶然发现原来的df不是以Sun开头的行,然后是Mon...实际上,df的行顺序根本没有改变!为什么呢?如果我想把df排列成aggreagre()result_2,我应该怎么做?
提前感谢!任何建议将不胜感激!

3htmauhk

3htmauhk1#

你能试试这个吗?

library(tidyverse)

df %>% 
  arrange(factor(
    day_of_week,
    ordered = TRUE,
    levels = c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat")
  ))

# A tibble: 14 × 3
   member_casual day_of_week ride_length
   <chr>         <chr>             <dbl>
 1 casual        Sun            13364709
 2 member        Sun             5485445
 3 casual        Mon             8198659
 4 member        Mon             5853290
 5 casual        Tue             6859473
 6 member        Tue             6300679
 7 casual        Wed             6901511
 8 member        Wed             6488745
 9 casual        Thu             8023285
10 member        Thu             6646226
11 casual        Fri             9461812
12 member        Fri             5962077
13 casual        Sat            15505160
14 member        Sat             6303482

相关问题