使用tidyverse按日期分组的大型R Dataframe 中的线性插值

bkkx9g8r  于 2023-06-03  发布在  其他
关注(0)|答案(3)|浏览(172)

我正在尝试在R Dataframe 中的值之间进行插值(线性或通过其他方法)。这些值在一个非常大的 Dataframe 中按date_index分组,并且插值是基于f_dates每天在值之间进行的。f_dates的数量会发生变化,因此可能需要为每个date_index对插值进行分组,然后将其解分组为更大的 Dataframe ?参见下面的示例。
我的代码尝试如下,但我似乎不能让分组和插值工作。

df <- data.frame("date_index"=c("20230104", "20230104", "20230104", "20230106",  "20230106", "20230106"),
"f_date"=c("20230201", "20230203", "20230210", "20230201", "20230203", "20230210"),
                 "value" = c(1.50, 2.00, 3.25, 4.00, 3.00, 2.50))  %>% 
  mutate(date_index=ymd(date_index), f_date=ymd(f_date))

df %>% 
  group_by(date_index) %>% 
  arrange(f_date)

date_first <- first(df$f_date)
date_last <- last(df$f_date)

ApproxFun <- approxfun(x = df$f_date, y = df$value)

date_seq <- seq.Date(ymd(date_first), ymd(date_last), by = 1)

LinearFit <- ApproxFun(date_seq)

# the interpolated dataframe
dfi <- data.frame(f_date = date_seq,  value = LinearFit)

我正在寻找的输出应该如下所示:

date_index  f_date  value
2023-01-04  2023-02-01  1.50
2023-01-04  2023-02-02  1.75
2023-01-04  2023-02-03  2.00
2023-01-04  2023-02-04  2.18
2023-01-04  2023-02-05  2.36
2023-01-04  2023-02-06  2.54
2023-01-04  2023-02-07  2.71
2023-01-04  2023-02-08  2.89
2023-01-04  2023-02-09  3.07
2023-01-04  2023-02-10  3.25
2023-01-06  2023-02-01  4.00
2023-01-06  2023-02-02  3.50
2023-01-06  2023-02-03  3.00
2023-01-06  2023-02-04  2.93
2023-01-06  2023-02-05  2.86
2023-01-06  2023-02-06  2.79
2023-01-06  2023-02-07  2.71
2023-01-06  2023-02-08  2.64
2023-01-06  2023-02-09  2.57
2023-01-06  2023-02-10  2.50
ukxgm1gy

ukxgm1gy1#

我不知道如何用一个“开箱即用”的功能来实现这一点,尽管我不怀疑这是可能的。首先,让我们做数据,然后我会解释解决方案。

library(dplyr)
library(lubridate)
df <- data.frame("date_index"=c("20230104", "20230104", "20230104", "20230106",  "20230106", "20230106"),
                 "f_date"=c("20230201", "20230203", "20230210", "20230201", "20230203", "20230210"),
                 "value" = c(1.50, 2.00, 3.25, 4.00, 3.00, 2.50))  %>% 
  mutate(date_index=ymd(date_index), f_date=ymd(f_date))

我写了一个函数,用一个刚刚确定的分段线性回归模型进行插值-它的参数与value变量的非缺失值一样多。然后,它使用该模型以value的观测值作为节点来生成预测。

interp_fun <- function(x,y){
  pwl <- function(x, k)sapply(k, function(K)ifelse(x >= K, x-K, 0))
  w <- which(!is.na(y))
  if(length(w) > 2){
    w <- w[-c(1, length(w))]
    kts <- x[w]
    mod <- lm(y ~ x + pwl(x, kts), na.action=na.exclude)
    ifelse(is.na(y), predict(mod, newdata=data.frame(x=x)), y)
  }else{
    mod <- lm(y ~ x, na.action=na.exclude)
    ifelse(is.na(y), predict(mod, newdata=data.frame(x=x)), y)
  }
}

使用插值函数,您可以按date_index对数据进行分组,然后生成从f_date的最小值到最大值的新日期序列,然后将其合并回原始数据。您可以按date_index分组,然后将插值函数应用于f_datevalue。结果看起来好像是你想要的。无论在观测端点之间有多少“内部”观测值,这都应该有效。

df %>% 
  group_by(date_index) %>% 
  arrange(f_date, .by_group = TRUE) %>% 
  reframe(f_date = seq(from = min(f_date), 
                       to = max(f_date), 
                       by = "1 day")) %>% 
  left_join(df) %>%
  group_by(date_index) %>% 
  mutate(value = interp_fun(f_date, value))
#> Joining with `by = join_by(date_index, f_date)`
#> # A tibble: 20 × 3
#> # Groups:   date_index [2]
#>    date_index f_date     value
#>    <date>     <date>     <dbl>
#>  1 2023-01-04 2023-02-01  1.5 
#>  2 2023-01-04 2023-02-02  1.75
#>  3 2023-01-04 2023-02-03  2   
#>  4 2023-01-04 2023-02-04  2.18
#>  5 2023-01-04 2023-02-05  2.36
#>  6 2023-01-04 2023-02-06  2.54
#>  7 2023-01-04 2023-02-07  2.71
#>  8 2023-01-04 2023-02-08  2.89
#>  9 2023-01-04 2023-02-09  3.07
#> 10 2023-01-04 2023-02-10  3.25
#> 11 2023-01-06 2023-02-01  4   
#> 12 2023-01-06 2023-02-02  3.50
#> 13 2023-01-06 2023-02-03  3   
#> 14 2023-01-06 2023-02-04  2.93
#> 15 2023-01-06 2023-02-05  2.86
#> 16 2023-01-06 2023-02-06  2.79
#> 17 2023-01-06 2023-02-07  2.71
#> 18 2023-01-06 2023-02-08  2.64
#> 19 2023-01-06 2023-02-09  2.57
#> 20 2023-01-06 2023-02-10  2.5

创建于2023-06-01使用reprex v2.0.2

toe95027

toe950272#

下面是一个使用tidyr::complete()函数和approx()

library(dplyr)
library(tidyr)
library(lubridate)

df %>% group_by(date_index) %>% 
   complete(f_date=full_seq(f_date, 1)) %>% 
   mutate(value=approx(f_date[!is.na(.$value)], y=value[!is.na(.$value)], xout=f_date, method = "linear")$y)

# A tibble: 20 × 3
# Groups:   date_index [2]
  date_index f_date     value
      <date>     <date>     <dbl>
1 2023-01-04 2023-02-01  1.5 
2 2023-01-04 2023-02-02  1.75
3 2023-01-04 2023-02-03  2   
4 2023-01-04 2023-02-04  2.18
5 2023-01-04 2023-02-05  2.36
6 2023-01-04 2023-02-06  2.54
7 2023-01-04 2023-02-07  2.71
8 2023-01-04 2023-02-08  2.89
9 2023-01-04 2023-02-09  3.07
10 2023-01-04 2023-02-10  3.25
11 2023-01-06 2023-02-01  4   
12 2023-01-06 2023-02-02  3.5 
13 2023-01-06 2023-02-03  3   
14 2023-01-06 2023-02-04  2.93
15 2023-01-06 2023-02-05  2.86
16 2023-01-06 2023-02-06  2.79
17 2023-01-06 2023-02-07  2.71
18 2023-01-06 2023-02-08  2.64
19 2023-01-06 2023-02-09  2.57
20 2023-01-06 2023-02-10  2.5

完整的函数是一个方便的拥有,并将填补在给定的序列中的缺失值,在这种情况下,在1天的间隔。
一旦定义了缺失的日期,则只需使用approx函数来插值NA值。

b5lpy0ml

b5lpy0ml3#

以下是我使用summariseunnest的情况:

library(dplyr, warn.conflicts = FALSE)
library(lubridate, warn.conflicts = FALSE)
library(tidyr)
interpolator <- function(x, y) {
  date_first <- first(x)
  date_last <- last(x)
  interpolant <- approxfun(x = x, y = y)
  date_seq <- seq.Date(ymd(date_first), ymd(date_last), by = 1)
  tibble(
    f_date = date_seq, value = interpolant(date_seq)
  )
}

df <- data.frame("date_index"=c("20230104", "20230104", "20230104", "20230106",  "20230106", "20230106"),
                 "f_date"=c("20230201", "20230203", "20230210", "20230201", "20230203", "20230210"),
                 "value" = c(1.50, 2.00, 3.25, 4.00, 3.00, 2.50))  %>% 
  mutate(date_index=ymd(date_index), f_date=ymd(f_date))

df %>% 
  group_by(date_index) %>% 
  arrange(f_date) %>% 
  summarise(
    tmp = list(interpolator(f_date, value))
  ) %>% 
  unnest(tmp)
#> # A tibble: 20 × 3
#>    date_index f_date     value
#>    <date>     <date>     <dbl>
#>  1 2023-01-04 2023-02-01  1.5 
#>  2 2023-01-04 2023-02-02  1.75
#>  3 2023-01-04 2023-02-03  2   
#>  4 2023-01-04 2023-02-04  2.18
#>  5 2023-01-04 2023-02-05  2.36
#>  6 2023-01-04 2023-02-06  2.54
#>  7 2023-01-04 2023-02-07  2.71
#>  8 2023-01-04 2023-02-08  2.89
#>  9 2023-01-04 2023-02-09  3.07
#> 10 2023-01-04 2023-02-10  3.25
#> 11 2023-01-06 2023-02-01  4   
#> 12 2023-01-06 2023-02-02  3.5 
#> 13 2023-01-06 2023-02-03  3   
#> 14 2023-01-06 2023-02-04  2.93
#> 15 2023-01-06 2023-02-05  2.86
#> 16 2023-01-06 2023-02-06  2.79
#> 17 2023-01-06 2023-02-07  2.71
#> 18 2023-01-06 2023-02-08  2.64
#> 19 2023-01-06 2023-02-09  2.57
#> 20 2023-01-06 2023-02-10  2.5

创建于2023-06-01使用reprex v2.0.2

相关问题