R语言 交易日期前6个月的总金额

o8x7eapl  于 2023-05-11  发布在  其他
关注(0)|答案(4)|浏览(107)

这是我的交易数据。它显示从from列中的帐户到to列中的帐户的交易,并带有日期和金额信息

data 

id          from    to          date        amount  
<int>       <fctr>  <fctr>      <date>      <dbl>
19521       6644    6934        2005-01-01  700.0
19524       6753    8456        2005-01-01  600.0
19523       9242    9333        2005-01-01  1000.0
…           …       …           …           …
1056317     7819    7454        2010-12-31  60.2
1056318     6164    7497        2010-12-31  107.5
1056319     7533    7492        2010-12-31  164.1

我想计算from列中的帐户在特定交易发生日期前6个月内收到的交易金额,并希望将此信息保存为新列。
下面的代码可以很好地在一个小数据集中完成这一点,比如说,有1000行:

library(dplyr)
library(purrr)
data %>% 
  mutate(total_trx_amount_received_in_last_sixmonth= map2_dbl(from, date, 
~sum(amount[to == .x & between(date, .y - 180, .y)])))

但是,由于我的数据有超过100万行,因此这段代码将需要几个小时才能完成。我在互联网上搜索,如果我可以加快这个代码的运行时间。我试过this suggestion关于如何使purrrmap函数运行得更快。因此,我尝试了以下代码,而不是mutatedplyr,我使用data.table来加快代码速度:

library(future)
library(data.table)
library(furrr)
data[, total_trx_amount_received_in_last_sixmonth:= furrr::future_pmap_dbl(list(from, date), 
~mean(amount[to == .x & between(date, .y-180, .y)])) ]

但是,速度并没有得到任何改善。
有什么建议可以让代码运行得更快吗?
dput()输出数据:

data <- data.frame(
  id = c(
    18529L, 13742L, 9913L, 956L, 2557L, 1602L, 18669L, 35900L,
    48667L, 51341L, 53713L, 60126L, 60545L, 65113L, 66783L, 83324L,
    87614L, 88898L, 89874L, 94765L, 100277L, 101587L, 103444L, 108414L,
    113319L, 121516L, 126607L, 130170L, 131771L, 135002L, 149431L,
    157403L, 157645L, 158831L, 162597L, 162680L, 163901L, 165044L,
    167082L, 168562L, 168940L, 172578L, 173031L, 173267L, 177507L,
    179167L, 182612L, 183499L, 188171L, 189625L, 193940L, 198764L,
    199342L, 200134L, 203328L, 203763L, 204733L, 205651L, 209672L,
    210242L, 210979L, 214532L, 214741L, 215738L, 216709L, 220828L,
    222140L, 222905L, 226133L, 226527L, 227160L, 228193L, 231782L,
    232454L, 233774L, 237836L, 237837L, 238860L, 240223L, 245032L,
    246673L, 247561L, 251611L, 251696L, 252663L, 254410L, 255126L,
    255230L, 258484L, 258485L, 259309L, 259910L, 260542L, 262091L,
    264462L, 264887L, 264888L, 266125L, 268574L, 272959L
  ),
  from = c(
    "5370", "5370", "5370", "8605", "5370", "6390", "5370", "5370", "8934",
    "5370", "5635", "6046", "5680", "8026", "9037", "5370", "7816", "8046",
    "5492", "8756", "5370", "9254", "5370", "5370", "7078", "6615", "5370",
    "9817", "8228", "8822", "5735", "7058", "5370", "8667", "9315", "6053",
    "7990", "8247", "8165", "5656", "9261", "5929", "8251", "5370", "6725",
    "5370", "6004", "7022", "7442", "5370", "8679", "6491", "7078", "5370",
    "5370", "5370", "5658", "5370", "9296", "8386", "5370", "5370", "5370",
    "9535", "5370", "7541", "5370", "9621", "5370", "7158", "8240", "5370",
    "5370", "8025", "5370", "5370", "5370", "6989", "5370", "7059", "5370",
    "5370", "5370", "9121", "5608", "5370", "5370", "7551", "5370", "5370",
    "5370", "5370", "9163", "9362", "6072", "5370", "5370", "5370", "5370",
    "5370"
  ),
  to = c(
    "9356", "5605", "8567", "5370", "5636", "5370", "8933", "8483", "5370",
    "7626", "5370", "5370", "5370", "5370", "5370", "9676", "5370", "5370",
    "5370", "5370", "9105", "5370", "9772", "6979", "5370", "5370", "7564",
    "5370", "5370", "5370", "5370", "5370", "8744", "5370", "5370", "5370",
    "5370", "5370", "5370", "5370", "5370", "5370", "5370", "7318", "5370",
    "8433", "5370", "5370", "5370", "7122", "5370", "5370", "5370", "8566",
    "6728", "9689", "5370", "8342", "5370", "5370", "5614", "5596", "5953",
    "5370", "7336", "5370", "7247", "5370", "7291", "5370", "5370", "6282",
    "7236", "5370", "8866", "8613", "9247", "5370", "6767", "5370", "9273",
    "7320", "9533", "5370", "5370", "8930", "9343", "5370", "9499", "7693",
    "7830", "5392", "5370", "5370", "5370", "7497", "8516", "9023", "7310",
    "8939"
  ),
  date = as.Date(c(
    "2005-05-31", "2005-08-05", "2005-09-12", "2005-10-05", "2005-11-12",
    "2005-11-26", "2005-11-30", "2006-01-31", "2006-03-31", "2006-04-11",
    "2006-04-30", "2006-05-28", "2006-05-31", "2006-06-10", "2006-06-15",
    "2006-08-31", "2006-09-09", "2006-09-13", "2006-09-18", "2006-10-07",
    "2006-10-31", "2006-10-31", "2006-11-08", "2006-11-30", "2006-12-11",
    "2007-01-05", "2007-01-13", "2007-01-24", "2007-01-29", "2007-01-31",
    "2007-03-24", "2007-04-13", "2007-04-14", "2007-04-23", "2007-04-30",
    "2007-04-30", "2007-05-06", "2007-05-09", "2007-05-13", "2007-05-23",
    "2007-05-27", "2007-05-31", "2007-06-03", "2007-06-05", "2007-06-13",
    "2007-06-22", "2007-06-30", "2007-06-30", "2007-07-13", "2007-07-22",
    "2007-07-31", "2007-08-13", "2007-08-14", "2007-08-21", "2007-08-31",
    "2007-08-31", "2007-08-31", "2007-09-05", "2007-09-13", "2007-09-14",
    "2007-09-20", "2007-09-30", "2007-09-30", "2007-09-30", "2007-10-05",
    "2007-10-13", "2007-10-20", "2007-10-27", "2007-10-31", "2007-10-31",
    "2007-10-31", "2007-11-05", "2007-11-12", "2007-11-13", "2007-11-19",
    "2007-11-30", "2007-11-30", "2007-11-30", "2007-12-05", "2007-12-13",
    "2007-12-19", "2007-12-24", "2007-12-31", "2007-12-31", "2007-12-31",
    "2008-01-04", "2008-01-05", "2008-01-05", "2008-01-09", "2008-01-09",
    "2008-01-10", "2008-01-11", "2008-01-12", "2008-01-13", "2008-01-17",
    "2008-01-18", "2008-01-18", "2008-01-21", "2008-01-27", "2008-01-31"
  )),
  amount = c(
    24.4, 7618, 21971, 5245, 2921, 8000, 169.2, 71.5, 14.6, 4214, 14.6, 13920,
    14.6, 24640, 1600, 261.1, 16400, 3500, 2700, 19882, 182, 14.6, 16927, 25653,
    3059, 2880, 9658, 4500, 12480, 14.6, 1000, 3679, 34430, 12600, 14.6, 19.2,
    4900, 826, 3679, 2100, 38000, 79, 11400, 21495, 3679, 200, 14.6, 100.6, 3679,
    5300, 108.9, 3679, 2696, 7500, 171.6, 14.6, 99.2, 2452, 3679, 3218, 700, 69.7,
    14.6, 91.5, 2452, 3679, 2900, 17572, 14.6, 14.6, 90.5, 2452, 49752, 3679,
    1900, 14.6, 870, 85.2, 2452, 3679, 1600, 540, 14.6, 14.6, 79, 210, 2452,
    28400, 720, 180, 420, 44289, 489, 3679, 840, 2900, 150, 870, 420, 14.6
  )
)
eimct9ow

eimct9ow1#

这只是data.table中的一个非等价连接。您可以创建一个变量date - 180,并限制当前日期和该变量之间的联接。这应该很快

library(data.table)
setDT(dt)[, date_minus_180 := date - 180]
dt[, amnt_6_m := .SD[dt, sum(amount, na.rm = TRUE), 
     on = .(to = from, date <= date, date >= date_minus_180), by = .EACHI]$V1]
head(dt, 10)
#        id from   to       date  amount date_minus_180 amnt_6_m
#  1: 18529 5370 9356 2005-05-31    24.4     2004-12-02      0.0
#  2: 13742 5370 5605 2005-08-05  7618.0     2005-02-06      0.0
#  3:  9913 5370 8567 2005-09-12 21971.0     2005-03-16      0.0
#  4:   956 8605 5370 2005-10-05  5245.0     2005-04-08      0.0
#  5:  2557 5370 5636 2005-11-12  2921.0     2005-05-16   5245.0
#  6:  1602 6390 5370 2005-11-26  8000.0     2005-05-30      0.0
#  7: 18669 5370 8933 2005-11-30   169.2     2005-06-03  13245.0
#  8: 35900 5370 8483 2006-01-31    71.5     2005-08-04  13245.0
#  9: 48667 8934 5370 2006-03-31    14.6     2005-10-02      0.0
# 10: 51341 5370 7626 2006-04-11  4214.0     2005-10-13   8014.6
jm81lzqq

jm81lzqq2#

以下是使用data.table的一个选项:

library(data.table)
setDT(df)
setkey(df, to, date)

# Unique combination of from and date
af <- df[, unique(.SD), .SDcols = c("from", "date")]

# For each combination check sum of incoming in the last 6 months
for (i in 1:nrow(af)) {
  set(
    af, i = i, j = "am6m", 
    value = df[(date) %between% (af$date[[i]] - c(180, 0)) & to == af$from[[i]], sum(amount)]
  )
}
# Join the results into the main data.frame
df[, am6m := af[.SD, on = .(from, date), am6m]]


> tail(df)
#        id from   to       date  amount    am6m
# 1:  18529 5370 9356 2005-05-31    24.4     0.0
# 2: 258484 5370 9499 2008-01-09   720.0 74543.5
# 3: 251611 5370 9533 2007-12-31    14.6 46143.5
# 4:  83324 5370 9676 2006-08-31   261.1 40203.8
# 5: 203763 5370 9689 2007-08-31    14.6 92353.1
# 6: 103444 5370 9772 2006-11-08 16927.0 82671.2
oxiaedzo

oxiaedzo3#

这里有一个使用窗口函数的选项。
但是,它们需要完整的日常数据才能工作,因此所需的内存量可能很大(必须为每个from设置一行用于每天)。
另请注意,此方法仅适用于大型数据集或直接在数据库上执行计算。将原始数据转换为没有间隙的格式需要大量的设置时间。最后连接数据需要时间。
然而,无论数据大小如何,滑动函数的速度都相对一致。与子集相反,子集随着作为子集的数据的大小的增加而随时间增加。

library(tidyverse)
library(tsibble)

# Calculate the 6 month window
six_mo_rollup <- data %>% 
  ## NOTE: You have to deal with duplicates somehow...either remove
  ## false duplicates or make them not duplicates...
  # We can get a unique from/date combo by summing since we need
  # to sum anyway.
  group_by(from,date) %>%
  summarise(amount = sum(amount),
            .groups = "keep") %>%
  ungroup() %>%
  # Now that each from/date is unique
  # convert data to a tsibble object
  as_tsibble(key = c(from),index = date) %>%
  # window functions can't have any missing time periods...so fill gaps
  # window functions grab 180 rows...not 180 days from the date
  group_by_key() %>%
  fill_gaps(.full = TRUE) %>%
  ungroup() %>%
  # arrange data from lowest to highest so slide can work right.
  arrange(date) %>%
  group_by(from) %>%
  mutate(
    six_mo_sum = slide_dbl(
      amount,
      sum,
      na.rm = TRUE, 
      .size = 180, 
      .align = "right"
    )
  ) %>%
  ungroup() %>%
  # any row without amount was created by fill_gaps in the example
  # so we can drop those rows to save space
  filter(!is.na(amount))

six_mo_rollup %>% filter(from == "5370")
# # A tsibble: 41 x 4 [1D]
# # Key:       from [1]
# from  date        amount six_mo_sum
#  <chr>  <date>      <dbl>      <dbl>
# 1 5370  2005-05-31    24.4        NA 
# 2 5370  2005-08-05  7618          NA 
# 3 5370  2005-09-12 21971          NA 
# 4 5370  2005-11-12  2921          NA 
# 5 5370  2005-11-30   169.      32679.
# 6 5370  2006-01-31    71.5     32751.
# 7 5370  2006-04-11  4214        7376.
# 8 5370  2006-08-31   261.       4475.
# 9 5370  2006-10-31   182         443.
# 10 5370  2006-11-08 16927       17370.
# # ... with 31 more rows

# Join the windowed data to the original dataset
data <- data %>%
  left_join(
    six_mo_rollup %>% select(from,date,six_mo_sum),
    by = c("from","date")
  )

更新:
在注解中,很明显您希望总结每个for的值。我一开始并不明白。代码的更新是将所有汇总更改为to而不是for
此外,您需要的值没有6个月的完整数据。所以你加上.partial = TRUE

# Calculate the 6 month window
six_mo_rollup <- data %>% 
  ## NOTE: You have to deal with duplicates somehow...either remove
  ## false duplicates or make them not duplicates...
  # We can get a unique from/date combo by summing since we need
  # to sum anyway.
  group_by(to,date) %>%
  summarise(amount = sum(amount),
            .groups = "keep") %>%
  ungroup() %>%
  # Now that each from/date is unique
  # convert data to a tsibble object
  as_tsibble(key = c(to),index = date) %>%
  # window functions can't have any missing time periods...so fill gaps
  # window functions grab 180 rows...not 180 days from the date
  group_by_key() %>%
  fill_gaps(.full = TRUE) %>%
  ungroup() %>%
  # arrange data from lowest to highest so slide can work right.
  arrange(date) %>%
  group_by(to) %>%
  mutate(
    six_mo_sum = slide_dbl(
      amount,
      sum,
      na.rm = TRUE, 
      .size = 180, 
      .align = "right",
      .partial = TRUE
    )
  ) %>%
  ungroup() %>%
  # any row without amount was created by fill_gaps in the example
  # so we can drop those rows to save space
  filter(!is.na(amount))

six_mo_rollup %>% filter(to == "5370")
# # A tsibble: 50 x 4 [1D]
# # Key:       to [1]
# to    date        amount six_mo_sum
# <chr> <date>       <dbl>      <dbl>
# 1 5370  2005-10-05  5245        5245 
# 2 5370  2005-11-26  8000       13245 
# 3 5370  2006-03-31    14.6     13260.
# 4 5370  2006-04-30    14.6      8029.
# 5 5370  2006-05-28 13920       13949.
# 6 5370  2006-05-31    14.6     13964.
# 7 5370  2006-06-10 24640       38604.
# 8 5370  2006-06-15  1600       40204.
# 9 5370  2006-09-09 16400       56604.
# 10 5370  2006-09-13  3500       60104.
# # ... with 40 more rows

# Join the windowed data to the original dataset
data <- data %>%
  left_join(
    six_mo_rollup %>% select(to,date,six_mo_sum),
    by = c("from" = "to","date" = "date")
  )
u5rb5r59

u5rb5r594#

1 m记录数据集足够小,不需要并行化。有很多方法可以做到这一点,“看”的权利,但不是...小心!
首先,你可能想知道为什么你最初的方法很慢?R是一种解释型数组语言。要以可接受的性能执行任何操作,您必须将向量传递给已在低级语言中预编译的快速函数。如果你在数据集上“Map”一个函数的元素,你就失去了向量化的大部分优点--purrr::mapbase::lapply等都基本上具有与预分配的for循环相当的性能,即。不太好。你正在进行100多万个单独的函数调用(每条记录一个)。这种并行化只能提高性能的一个因素,无论你有多少核心减去一些开销。
为您澄清问题:

  • 每个账户每天只能进行一次交易吗?或者在任何一天都可以进行多次交易吗?我假设是的,每天多笔交易是可能的。
  • “在进行特定交易的日期之前的最近6个月内,从列中收到的账户的交易金额”-我假设这意味着“忽略与获得附加字段的交易相同日期进行的交易”,因为没有办法确定这些交易是在什么时间执行的

我的方法:先按科目、日合计,再按日计算滚动合计,再与下一日合并。

install.packages("RcppRoll") # for roll_sum()
install.packages(tidyr)      # for complete()

library(dplyr)

start_date <- as.Date("2018-01-01")
end_date <- as.Date("2020-01-01")
window_size <- 180L

# your example dataset is way too small to assess performance.
# Here is a 100k record dataset.

big_data <- tibble(
  from = as.factor(sapply(1:1000L, function(x) sample(1:100L,100, replace = F))),
  to = as.factor(sapply(1:1000L, function(x) sample(1:100L,100, replace = F))),
  amount = sample(1:10000, 100000, replace = TRUE),
  date = sample(seq.Date(from = start_date, to = end_date, by = "days"), 100000, replace = TRUE)
) %>%
  arrange(date) %>%
  mutate(id = row_number()) %>% 
  ungroup()

# calculate daily sum of values from PRECEDING day for join
daily_summary <- big_data %>%
  group_by(to, date) %>%
  summarize(daily_sum = sum(amount, na.rm = TRUE)) %>%
  ungroup() %>%
  # backfill empty records for data going back 6 months from start
  # this is needed because roll_sum() has no partial mode implemented.
  # and populate missing account - date combinations
  complete(date = seq.Date(from = start_date - window_size, to = end_date, by = "days"), to, fill = list(daily_sum = 0)) %>%
  group_by(to) %>%
  arrange(date) %>%
  mutate(
    total_trx_amount_received_in_last_sixmonth = RcppRoll::roll_sum(daily_sum, align = "right", n = window_size, fill = NA),
    date = date + 1
  ) %>%
  filter(date >= start_date) %>%
  select(date = date, from = to, total_trx_amount_received_in_last_sixmonth)

results <- left_join(big_data, daily_summary, by = c("from", "date"))

那么,性能如何呢?比你报道的要好得多,至少对我来说。对于一个100 k记录数据集(100个账户,2年的信息),我在笔记本电脑上的时间为0.6秒。对于一个100万条记录的数据集(1000个账户,2年的信息),我使用microbenchmark得到了7-8秒。可能不是最有效的方法,但考虑到我没有优化,也没有使用data.table,这通常是R中高性能2d操作的关键,所以可以接受。
使用dplyr分组仍然意味着我们对每个帐户进行一次快速预编译函数RcppRoll::roll_sum()的调用,从性能的Angular 来看,这并不理想,但至少我们对每个帐户只进行一次函数调用,而不是对每个记录进行一次函数调用。您可能还想研究一下RollingWindow包中实现的单通道滚动窗口函数,因为它们可能更快。

相关问题