在R中如何从一个 Dataframe 的多个列中识别出特定范围内的日期?

ahy6op9u  于 2023-01-06  发布在  其他
关注(0)|答案(2)|浏览(101)

我有一个非常大的 Dataframe ,包含超过300个日期列(且〉100,000行)。每行还包含一个“索引日期”。对于每行,我想确定是否有任何后续日期列包含索引日期6个月内的日期。然后,我需要一个单独列的输出,它告诉我该行的哪些日期列包含索引日期6个月内的日期。恐怕我还没有真正尝试过任何有意义的东西,因为我不知道从哪里开始,但我提供了一个简单的例子,我正在努力实现以下。我将非常感谢任何想法,请!
Dataframe 示例

library(dplyr)

example <- data.frame(
'index_date' = as.Date(c('2006/01/01', '2007/01/01', '2008/01/01', '2009/01/01', '2010/01/01')),
'date_1' = as.Date(c('2006/02/01', '2004/01/01', '2004/01/01', '2004/01/01', '2004/01/01')),
'date_2' = as.Date(c('2011/02/01', '2006/12/01', '2005/01/01', '2005/01/01', '2005/01/01')),
'date_3' = as.Date(c('2012/02/01', '2012/12/01', '2012/01/01', '2012/01/01', '2010/03/01')),
'date_4' = as.Date(c('2013/03/01', '2012/12/01', '2012/01/01', '2012/01/01', '2010/06/01'))) %>% 

# now create the `date_range` columns for each row. `start_range` is 6 months prior to the index date (182.6 days) and `end_range` is 6 months (182.6 days) after the index date

 mutate(start_range = (index_date - 182.6),
         end_range = (index_date + 182.6))

这将生成以下 Dataframe

index_date     date_1     date_2     date_3     date_4 start_range  end_range
1 2006-01-01 2006-02-01 2011-02-01 2012-02-01 2006-03-01  2005-07-02 2006-07-02
2 2007-01-01 2004-01-01 2006-12-01 2012-12-01 2012-12-01  2006-07-02 2007-07-02
3 2008-01-01 2004-01-01 2005-01-01 2012-01-01 2012-01-01  2007-07-02 2008-07-01
4 2009-01-01 2004-01-01 2005-01-01 2012-01-01 2012-01-01  2008-07-02 2009-07-02
5 2010-01-01 2004-01-01 2005-01-01 2010-03-01 2010-06-01  2009-07-02 2010-07-02

现在,我想要做的是创建另一个名为matching_date的列,对于每一行,它告诉我哪些日期列在日期范围内。(date_3date_4),因此,理想情况下,我还需要合并一个函数来标识最接近索引日期的日期(在本例中,它将是date_3)。
期望输出

index_date     date_1     date_2     date_3     date_4 start_range  end_range matching_date
1 2006-01-01 2006-02-01 2011-02-01 2012-02-01 2006-03-01  2005-07-02 2006-07-02        date_1
2 2007-01-01 2004-01-01 2006-12-01 2012-12-01 2012-12-01  2006-07-02 2007-07-02        date_2
3 2008-01-01 2004-01-01 2005-01-01 2012-01-01 2012-01-01  2007-07-02 2008-07-01            NA
4 2009-01-01 2004-01-01 2005-01-01 2012-01-01 2012-01-01  2008-07-02 2009-07-02            NA
5 2010-01-01 2004-01-01 2005-01-01 2010-03-01 2010-06-01  2009-07-02 2010-07-02        date_3
e4eetjau

e4eetjau1#

使用dplyr的方法。

  • 查找落入该范围的行。
  • 获取日期到范围中点的最小距离(index_date)。
  • 将匹配范围IsMinindex_date之间的最小距离与关联列名匹配。
library(dplyr)

example %>% 
  rowwise() %>% 
  mutate(Is = any(across(date_1:date_4, ~ .x >= start_range & .x <= end_range)), 
         Min = which.min(across(date_1:date_4, ~ abs(.x - index_date))), 
         matching_date = if_else(Is, 
           unlist(list(colnames(across(date_1:date_4))))[Min], NA_character_), 
         Is = NULL, Min = NULL) %>% 
  ungroup() %>% 
  print(Inf)
# A tibble: 5 × 8
  index_date date_1     date_2     date_3     date_4     start_range end_range 
  <date>     <date>     <date>     <date>     <date>     <date>      <date>    
1 2006-01-01 2006-02-01 2011-02-01 2012-02-01 2013-03-01 2005-07-02  2006-07-02
2 2007-01-01 2004-01-01 2006-12-01 2012-12-01 2012-12-01 2006-07-02  2007-07-02
3 2008-01-01 2004-01-01 2005-01-01 2012-01-01 2012-01-01 2007-07-02  2008-07-01
4 2009-01-01 2004-01-01 2005-01-01 2012-01-01 2012-01-01 2008-07-02  2009-07-02
5 2010-01-01 2004-01-01 2005-01-01 2010-03-01 2010-06-01 2009-07-02  2010-07-02
  matching_date
  <chr>        
1 date_1       
2 date_2       
3 NA           
4 NA           
5 date_3
数据
example <- structure(list(index_date = structure(c(13149, 13514, 13879, 
14245, 14610), class = "Date"), date_1 = structure(c(13180, 12418, 
12418, 12418, 12418), class = "Date"), date_2 = structure(c(15006, 
13483, 12784, 12784, 12784), class = "Date"), date_3 = structure(c(15371, 
15675, 15340, 15340, 14669), class = "Date"), date_4 = structure(c(15765, 
15675, 15340, 15340, 14761), class = "Date"), start_range = structure(c(12966.4, 
13331.4, 13696.4, 14062.4, 14427.4), class = "Date"), end_range = structure(c(13331.6, 
13696.6, 14061.6, 14427.6, 14792.6), class = "Date")), class = "data.frame", row.names = c(NA, 
-5L))
cnjp1d6j

cnjp1d6j2#

假设您的记录是由index_date唯一标识的,您可以这样做:

library(tidyverse)
example <- data.frame(
'index_date' = as.Date(c('2006/01/01', '2007/01/01', '2008/01/01', '2009/01/01', '2010/01/01')),
'date_1' = as.Date(c('2006/02/01', '2004/01/01', '2004/01/01', '2004/01/01', '2004/01/01')),
'date_2' = as.Date(c('2011/02/01', '2006/12/01', '2005/01/01', '2005/01/01', '2005/01/01')),
'date_3' = as.Date(c('2012/02/01', '2012/12/01', '2012/01/01', '2012/01/01', '2010/03/01')),
'date_4' = as.Date(c('2013/03/01', '2012/12/01', '2012/01/01', '2012/01/01', '2010/06/01'))) %>% 

# now create the `date_range` columns for each row. `start_range` is 6 months prior to the index date (182.6 days) and `end_range` is 6 months (182.6 days) after the index date

 mutate(start_range = (index_date - 182.6),
         end_range = (index_date + 182.6))

example_long <- 
  example |> 
  pivot_longer(
    cols=starts_with("date"),
    names_to="vars",
    values_to="dates") |> 
  mutate(diff = dates - index_date) |> 
  rowwise() |> 
  mutate(matching_date = ifelse(between(dates, start_range, end_range), vars, NA)) |> 
  filter(!is.na(matching_date)) |> 
  group_by(index_date) |> 
  mutate(tie = which(diff == min(diff))) |> 
  filter(tie == row_number()) |> 
  select(index_date, matching_date)

left_join(example, example_long)
#> Joining, by = "index_date"
#>   index_date     date_1     date_2     date_3     date_4 start_range  end_range
#> 1 2006-01-01 2006-02-01 2011-02-01 2012-02-01 2013-03-01  2005-07-02 2006-07-02
#> 2 2007-01-01 2004-01-01 2006-12-01 2012-12-01 2012-12-01  2006-07-02 2007-07-02
#> 3 2008-01-01 2004-01-01 2005-01-01 2012-01-01 2012-01-01  2007-07-02 2008-07-01
#> 4 2009-01-01 2004-01-01 2005-01-01 2012-01-01 2012-01-01  2008-07-02 2009-07-02
#> 5 2010-01-01 2004-01-01 2005-01-01 2010-03-01 2010-06-01  2009-07-02 2010-07-02
#>   matching_date
#> 1        date_1
#> 2        date_2
#> 3          <NA>
#> 4          <NA>
#> 5        date_3

创建于2023年1月5日,使用reprex v2.0.2

相关问题