R语言 筛选多个范围的时间序列

sg24os4d  于 2022-12-25  发布在  其他
关注(0)|答案(3)|浏览(217)

假设我有这样一个时间序列 Dataframe :

date value1
1  2021-10-12  1.015
2  2021-10-13     NA
3  2021-10-14     NA
4  2021-10-15  1.015
5  2021-10-16  1.015
6  2021-10-17  1.015
7  2021-10-18  1.015
8  2021-10-19  1.015
9  2021-10-20  1.015
10 2021-10-21  1.015
11 2021-10-22  1.015
12 2021-10-23  1.015

df1 <- structure(list(date = structure(c(18912, 18913, 18914, 18915, 
                                       18916, 18917, 18918, 18919, 
                                       18920, 18921, 18922, 18923), class = "Date"), 
                       value1 = c(1.015, NA, NA, 1.015, 1.015, 1.015, 1.015, 1.015, 
                                  1.015, 1.015, 1.015, 1.015)), 
                  row.names = c(NA, -12L), class = "data.frame")

我想过滤此数据集,以获取存储在 Dataframe 中的日期范围,例如

Start      End
2021-10-12 2021-10-14
2021-10-16 2021-10-18
2021-10-22 2021-10-23

dtr <- structure(list(Start = structure(c(18912, 18916, 18922), class = "Date"), 
                      End = structure(c(18914, 18918, 18923), class = "Date")), 
                 class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -3L))

如果我想手动完成,我可以写一个case_when,然后使用between或类似的东西来过滤每个范围,但是应该有一种方法来循环遍历这些范围,并使用矢量化的解决方案来过滤。
下面是case_when的方法:

df1 %>%
  filter(case_when(
    between(date, dtr$Start[1] & dtr$End[1]) ~ T,
    between(date, dtr$Start[2] & dtr$End[2]) ~ T,
    between(date, dtr$Start[3] & dtr$End[3]) ~ T,
    TRUE ~ F
  )

##         date value1
## 1 2021-10-12  1.015
## 2 2021-10-13     NA
## 3 2021-10-14     NA
## 4 2021-10-16  1.015
## 5 2021-10-17  1.015
## 6 2021-10-18  1.015
## 7 2021-10-22  1.015
## 8 2021-10-23  1.015

如何矢量化过滤?

rdlzhqv9

rdlzhqv91#

我们可以遍历dtr行,应用filter,然后应用bind_rows

library(dplyr)

bind_rows(lapply(1:nrow(dtr), 
                 function(i) filter(df1, between(date, dtr$Start[i], dtr$End[i]))))

##         date value1
## 1 2021-10-12  1.015
## 2 2021-10-13     NA
## 3 2021-10-14     NA
## 4 2021-10-16  1.015
## 5 2021-10-17  1.015
## 6 2021-10-18  1.015
## 7 2021-10-22  1.015
## 8 2021-10-23  1.015

data.table具有%inrange%函数,该函数对于这种情况将是真正有效的;参见Convenience functions for range subsets

library(data.table)
setDT(df1)[date %inrange% setDT(dtr)]
gorkyyrv

gorkyyrv2#

dplyrdevel版本中,我们可以使用join_by

library(dplyr)
inner_join(df1, dtr, join_by(between(date, Start, End))) %>% 
   select(names(df1))
  • 输出
date value1
1 2021-10-12  1.015
2 2021-10-13     NA
3 2021-10-14     NA
4 2021-10-16  1.015
5 2021-10-17  1.015
6 2021-10-18  1.015
7 2021-10-22  1.015
8 2021-10-23  1.015

或者使用pmap/map2

library(purrr)
pmap_dfr(dtr, ~ filter(df1, between(date, .x, .y)))
        date value1
1 2021-10-12  1.015
2 2021-10-13     NA
3 2021-10-14     NA
4 2021-10-16  1.015
5 2021-10-17  1.015
6 2021-10-18  1.015
7 2021-10-22  1.015
8 2021-10-23  1.015
i5desfxk

i5desfxk3#

下面是fuzzyjoin的另一个选项(也是一个连接),我喜欢的是match_fun参数:

library(dplyr)
library(fuzzyjoin)
df1 %>%
  fuzzy_inner_join(y = dtr,
                   by = c("date" = "Start", "date" = "End"),
                   match_fun = list(`>=`, `<=`))
date value1      Start        End
1 2021-10-12  1.015 2021-10-12 2021-10-14
2 2021-10-13     NA 2021-10-12 2021-10-14
3 2021-10-14     NA 2021-10-12 2021-10-14
4 2021-10-16  1.015 2021-10-16 2021-10-18
5 2021-10-17  1.015 2021-10-16 2021-10-18
6 2021-10-18  1.015 2021-10-16 2021-10-18
7 2021-10-22  1.015 2021-10-22 2021-10-23
8 2021-10-23  1.015 2021-10-22 2021-10-23

相关问题