这是我的交易数据。它显示从from
列中的帐户到to
列中的帐户的交易,并带有日期和金额信息
data
id from to date amount
<int> <fctr> <fctr> <date> <dbl>
19521 6644 6934 2005-01-01 700.0
19524 6753 8456 2005-01-01 600.0
19523 9242 9333 2005-01-01 1000.0
… … … … …
1056317 7819 7454 2010-12-31 60.2
1056318 6164 7497 2010-12-31 107.5
1056319 7533 7492 2010-12-31 164.1
我想计算from
列中的帐户在特定交易发生日期前6个月内收到的交易金额,并希望将此信息保存为新列。
下面的代码可以很好地在一个小数据集中完成这一点,比如说,有1000行:
library(dplyr)
library(purrr)
data %>%
mutate(total_trx_amount_received_in_last_sixmonth= map2_dbl(from, date,
~sum(amount[to == .x & between(date, .y - 180, .y)])))
但是,由于我的数据有超过100万行,因此这段代码将需要几个小时才能完成。我在互联网上搜索,如果我可以加快这个代码的运行时间。我试过this suggestion关于如何使purrr
map
函数运行得更快。因此,我尝试了以下代码,而不是mutate
或dplyr
,我使用data.table
来加快代码速度:
library(future)
library(data.table)
library(furrr)
data[, total_trx_amount_received_in_last_sixmonth:= furrr::future_pmap_dbl(list(from, date),
~mean(amount[to == .x & between(date, .y-180, .y)])) ]
但是,速度并没有得到任何改善。
有什么建议可以让代码运行得更快吗?
dput()输出数据:
data <- data.frame(
id = c(
18529L, 13742L, 9913L, 956L, 2557L, 1602L, 18669L, 35900L,
48667L, 51341L, 53713L, 60126L, 60545L, 65113L, 66783L, 83324L,
87614L, 88898L, 89874L, 94765L, 100277L, 101587L, 103444L, 108414L,
113319L, 121516L, 126607L, 130170L, 131771L, 135002L, 149431L,
157403L, 157645L, 158831L, 162597L, 162680L, 163901L, 165044L,
167082L, 168562L, 168940L, 172578L, 173031L, 173267L, 177507L,
179167L, 182612L, 183499L, 188171L, 189625L, 193940L, 198764L,
199342L, 200134L, 203328L, 203763L, 204733L, 205651L, 209672L,
210242L, 210979L, 214532L, 214741L, 215738L, 216709L, 220828L,
222140L, 222905L, 226133L, 226527L, 227160L, 228193L, 231782L,
232454L, 233774L, 237836L, 237837L, 238860L, 240223L, 245032L,
246673L, 247561L, 251611L, 251696L, 252663L, 254410L, 255126L,
255230L, 258484L, 258485L, 259309L, 259910L, 260542L, 262091L,
264462L, 264887L, 264888L, 266125L, 268574L, 272959L
),
from = c(
"5370", "5370", "5370", "8605", "5370", "6390", "5370", "5370", "8934",
"5370", "5635", "6046", "5680", "8026", "9037", "5370", "7816", "8046",
"5492", "8756", "5370", "9254", "5370", "5370", "7078", "6615", "5370",
"9817", "8228", "8822", "5735", "7058", "5370", "8667", "9315", "6053",
"7990", "8247", "8165", "5656", "9261", "5929", "8251", "5370", "6725",
"5370", "6004", "7022", "7442", "5370", "8679", "6491", "7078", "5370",
"5370", "5370", "5658", "5370", "9296", "8386", "5370", "5370", "5370",
"9535", "5370", "7541", "5370", "9621", "5370", "7158", "8240", "5370",
"5370", "8025", "5370", "5370", "5370", "6989", "5370", "7059", "5370",
"5370", "5370", "9121", "5608", "5370", "5370", "7551", "5370", "5370",
"5370", "5370", "9163", "9362", "6072", "5370", "5370", "5370", "5370",
"5370"
),
to = c(
"9356", "5605", "8567", "5370", "5636", "5370", "8933", "8483", "5370",
"7626", "5370", "5370", "5370", "5370", "5370", "9676", "5370", "5370",
"5370", "5370", "9105", "5370", "9772", "6979", "5370", "5370", "7564",
"5370", "5370", "5370", "5370", "5370", "8744", "5370", "5370", "5370",
"5370", "5370", "5370", "5370", "5370", "5370", "5370", "7318", "5370",
"8433", "5370", "5370", "5370", "7122", "5370", "5370", "5370", "8566",
"6728", "9689", "5370", "8342", "5370", "5370", "5614", "5596", "5953",
"5370", "7336", "5370", "7247", "5370", "7291", "5370", "5370", "6282",
"7236", "5370", "8866", "8613", "9247", "5370", "6767", "5370", "9273",
"7320", "9533", "5370", "5370", "8930", "9343", "5370", "9499", "7693",
"7830", "5392", "5370", "5370", "5370", "7497", "8516", "9023", "7310",
"8939"
),
date = as.Date(c(
"2005-05-31", "2005-08-05", "2005-09-12", "2005-10-05", "2005-11-12",
"2005-11-26", "2005-11-30", "2006-01-31", "2006-03-31", "2006-04-11",
"2006-04-30", "2006-05-28", "2006-05-31", "2006-06-10", "2006-06-15",
"2006-08-31", "2006-09-09", "2006-09-13", "2006-09-18", "2006-10-07",
"2006-10-31", "2006-10-31", "2006-11-08", "2006-11-30", "2006-12-11",
"2007-01-05", "2007-01-13", "2007-01-24", "2007-01-29", "2007-01-31",
"2007-03-24", "2007-04-13", "2007-04-14", "2007-04-23", "2007-04-30",
"2007-04-30", "2007-05-06", "2007-05-09", "2007-05-13", "2007-05-23",
"2007-05-27", "2007-05-31", "2007-06-03", "2007-06-05", "2007-06-13",
"2007-06-22", "2007-06-30", "2007-06-30", "2007-07-13", "2007-07-22",
"2007-07-31", "2007-08-13", "2007-08-14", "2007-08-21", "2007-08-31",
"2007-08-31", "2007-08-31", "2007-09-05", "2007-09-13", "2007-09-14",
"2007-09-20", "2007-09-30", "2007-09-30", "2007-09-30", "2007-10-05",
"2007-10-13", "2007-10-20", "2007-10-27", "2007-10-31", "2007-10-31",
"2007-10-31", "2007-11-05", "2007-11-12", "2007-11-13", "2007-11-19",
"2007-11-30", "2007-11-30", "2007-11-30", "2007-12-05", "2007-12-13",
"2007-12-19", "2007-12-24", "2007-12-31", "2007-12-31", "2007-12-31",
"2008-01-04", "2008-01-05", "2008-01-05", "2008-01-09", "2008-01-09",
"2008-01-10", "2008-01-11", "2008-01-12", "2008-01-13", "2008-01-17",
"2008-01-18", "2008-01-18", "2008-01-21", "2008-01-27", "2008-01-31"
)),
amount = c(
24.4, 7618, 21971, 5245, 2921, 8000, 169.2, 71.5, 14.6, 4214, 14.6, 13920,
14.6, 24640, 1600, 261.1, 16400, 3500, 2700, 19882, 182, 14.6, 16927, 25653,
3059, 2880, 9658, 4500, 12480, 14.6, 1000, 3679, 34430, 12600, 14.6, 19.2,
4900, 826, 3679, 2100, 38000, 79, 11400, 21495, 3679, 200, 14.6, 100.6, 3679,
5300, 108.9, 3679, 2696, 7500, 171.6, 14.6, 99.2, 2452, 3679, 3218, 700, 69.7,
14.6, 91.5, 2452, 3679, 2900, 17572, 14.6, 14.6, 90.5, 2452, 49752, 3679,
1900, 14.6, 870, 85.2, 2452, 3679, 1600, 540, 14.6, 14.6, 79, 210, 2452,
28400, 720, 180, 420, 44289, 489, 3679, 840, 2900, 150, 870, 420, 14.6
)
)
4条答案
按热度按时间eimct9ow1#
这只是data.table中的一个非等价连接。您可以创建一个变量
date - 180
,并限制当前日期和该变量之间的联接。这应该很快jm81lzqq2#
以下是使用
data.table
的一个选项:oxiaedzo3#
这里有一个使用窗口函数的选项。
但是,它们需要完整的日常数据才能工作,因此所需的内存量可能很大(必须为每个from设置一行用于每天)。
另请注意,此方法仅适用于大型数据集或直接在数据库上执行计算。将原始数据转换为没有间隙的格式需要大量的设置时间。最后连接数据需要时间。
然而,无论数据大小如何,滑动函数的速度都相对一致。与子集相反,子集随着作为子集的数据的大小的增加而随时间增加。
更新:
在注解中,很明显您希望总结每个for的值。我一开始并不明白。代码的更新是将所有汇总更改为
to
而不是for
。此外,您需要的值没有6个月的完整数据。所以你加上
.partial = TRUE
。u5rb5r594#
1 m记录数据集足够小,不需要并行化。有很多方法可以做到这一点,“看”的权利,但不是...小心!
首先,你可能想知道为什么你最初的方法很慢?R是一种解释型数组语言。要以可接受的性能执行任何操作,您必须将向量传递给已在低级语言中预编译的快速函数。如果你在数据集上“Map”一个函数的元素,你就失去了向量化的大部分优点--
purrr::map
,base::lapply
等都基本上具有与预分配的for循环相当的性能,即。不太好。你正在进行100多万个单独的函数调用(每条记录一个)。这种并行化只能提高性能的一个因素,无论你有多少核心减去一些开销。为您澄清问题:
我的方法:先按科目、日合计,再按日计算滚动合计,再与下一日合并。
那么,性能如何呢?比你报道的要好得多,至少对我来说。对于一个100 k记录数据集(100个账户,2年的信息),我在笔记本电脑上的时间为0.6秒。对于一个100万条记录的数据集(1000个账户,2年的信息),我使用
microbenchmark
得到了7-8秒。可能不是最有效的方法,但考虑到我没有优化,也没有使用data.table
,这通常是R中高性能2d操作的关键,所以可以接受。使用
dplyr
分组仍然意味着我们对每个帐户进行一次快速预编译函数RcppRoll::roll_sum()
的调用,从性能的Angular 来看,这并不理想,但至少我们对每个帐户只进行一次函数调用,而不是对每个记录进行一次函数调用。您可能还想研究一下RollingWindow包中实现的单通道滚动窗口函数,因为它们可能更快。