R各组的变化率

wz3gfoph  于 2023-04-03  发布在  其他
关注(0)|答案(2)|浏览(124)

我有一个带有两列(股票名称和价格)的dataframe df:
| 名称|价格|
| --------------|--------------|
| A|第二章|
| A|三|
| A|四|
| A|四|
| B|七|
| B|六|
| C|三|
等等
我需要一个函数来添加另一个列,显示在n个以前的记录的变化率,分别为每个NAME组计算。当记录少于n时,则结果应包括给定组中可用数据的最大数量。例如:RateofChange(1)应添加以下内容:
| 名称|价格|ROC1|
| --------------|--------------|--------------|
| A|第二章|0|
| A|三|0.5分|
| A|四|0.33|
| A|四|0|
| B|七|0|
| B|六| -0.14 |
| C|三|0|
而RateofChange(2)应导致:
| 名称|价格|ROC2|
| --------------|--------------|--------------|
| A|第二章|0|
| A|三|0.5分|
| A|四|1|
| A|四|0.33|
| B|七|0|
| B|六| -0.14 |
| C|三|0|
我有下面的函数可以工作,但它非常慢(我需要计算230万条记录,需要几个小时)。可以使用向量/滞后函数等更有效地重写它吗?

RateofChange1<- function(n){
      nmax=nrow(df)
      newvarname<-paste("ROC",n,sep="")
      for (i in (n+1):nmax){
        l=0
        for (k in 1:n){
          if (df$NAME[i]==df$NAME[i-k]){l=k}
        }
        if (l>0){df[i,newvarname]<<-(df$PRICE[i]-df$PRICE[i-l])/df$PRICE[i-l]}
      }
    }
x4shl7ld

x4shl7ld1#

我认为您可以在每组NAME中尝试以下操作

PRICE / PRICE[pmax(1, seq_along(PRICE - n)] - 1

例如,对于data.table

library(data.table)

k <- 1:5
setDT(df)[
  ,
  paste0("ROC", k) := lapply(
    k,
    function(n) PRICE / PRICE[pmax(1, 1:.N - n)] - 1
  ),
  NAME
]

你最终会得到

> df
   NAME PRICE       ROC1       ROC2       ROC3       ROC4       ROC5
1:    A     2  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000
2:    A     3  0.5000000  0.5000000  0.5000000  0.5000000  0.5000000
3:    A     4  0.3333333  1.0000000  1.0000000  1.0000000  1.0000000
4:    A     4  0.0000000  0.3333333  1.0000000  1.0000000  1.0000000
5:    B     7  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000
6:    B     6 -0.1428571 -0.1428571 -0.1428571 -0.1428571 -0.1428571
7:    C     3  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000

数据

> dput(df)
structure(list(NAME = c("A", "A", "A", "A", "B", "B", "C"), PRICE = c(2L,
3L, 4L, 4L, 7L, 6L, 3L)), class = "data.frame", row.names = c(NA,
-7L))
jm81lzqq

jm81lzqq2#

你可以利用跑步者和崩溃来获得更快的速度。
如果没有n的限制,你可以简单地使用collapse::flag(),这将比下面的例子快得多。

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(collapse)
#> collapse 1.9.3, see ?`collapse-package` or ?`collapse-documentation`
#> 
#> Attaching package: 'collapse'
#> The following object is masked from 'package:stats':
#> 
#>     D
library(runner)
library(tidyr)
df <- tibble(NAME = c(rep("A", 4), "B", "B", "C"),
             PRICE = c(2, 3, 4, 4, 7, 6, 3))

# Window vector based on
# k - window size
# n length of vector
# partial window or not?
window_seq <- function (k, n, partial = TRUE){
  stopifnot(length(k) == 1L, length(n) == 1L)
  if (n > .Machine[["integer.max"]]) 
    stop("n must not be greater than .Machine$integer.max")
  n <- as.integer(n)
  k <- as.integer(min(k, n))
  k <- max(k, 0L)
  partial_len <- max(min(k - 1L, n), 0L)
  other_len <- max(0L, n - partial_len)
  if (partial) {
    c(seq_len(partial_len), rep_len(k, other_len))
  }
  else {
    c(rep_len(NA_integer_, partial_len), rep_len(k, other_len))
  }
}

df %>%
  mutate(n1 = 1,
         n2 = 2) %>%
  mutate(g = as.integer(collapse::group(NAME))) %>% # GROUP ID
  mutate(grpn = collapse::GRPN(g)) %>% # GROUP SIZES
  # Using runner we utilise partial windows
  mutate(n1 = window_seq(k = n1[[1]] + 1, 
                         n = grpn[[1]],
                          partial = TRUE),
         n2 = window_seq(k = n2[[1]] + 1, 
                         n = grpn[[1]],
                         partial = TRUE),
         roc1 = (PRICE/runner::lag_run(PRICE, lag = n1 - 1)) - 1,
         roc2 = (PRICE/runner::lag_run(PRICE, lag = n2 - 1)) - 1,
         .by = g)
#> # A tibble: 7 × 8
#>   NAME  PRICE    n1    n2     g  grpn   roc1   roc2
#>   <chr> <dbl> <int> <int> <int> <int>  <dbl>  <dbl>
#> 1 A         2     1     1     1     4  0      0    
#> 2 A         3     2     2     1     4  0.5    0.5  
#> 3 A         4     2     3     1     4  0.333  1    
#> 4 A         4     2     3     1     4  0      0.333
#> 5 B         7     1     1     2     2  0      0    
#> 6 B         6     2     2     2     2 -0.143 -0.143
#> 7 C         3     1     1     3     1  0      0

创建于2023-03-28带有reprex v2.0.2

相关问题