R语言 完整缺失、非重叠日期范围、截断、可变开始/结束日期、指示变量

bmp9r5qi  于 2023-06-19  发布在  其他
关注(0)|答案(1)|浏览(85)

我的问题是Complete missing dates based on start and endComplete missing, non-overlapping date ranges的变体。以下是我的起始数据:

have <- data.frame(ID = c(1, 1),
                 date1 = as.Date(c("1999-09-01", "2000-01-29")),
                 date2 = as.Date(c("2000-01-15", "2001-04-30")),
                 start = as.Date(c("2000-01-01", "2000-01-01")),
                 end = as.Date(c("2001-07-31", "2001-07-31")),
                 ind = c(1,1))

> have
  ID      date1      date2      start        end ind
1  1 1999-09-01 2000-01-15 2000-01-01 2001-07-31   1
2  1 2000-01-29 2001-04-30 2000-01-01 2001-07-31   1

我想要的输出如下所示:

want <- data.frame(ID = c(1, 1, 1, 1),
                 date1 = as.Date(c("2000-01-01", "2000-01-16", "2000-01-29", "2001-05-01")),
                 date2 = as.Date(c("2000-01-15", "2000-01-28", "2001-04-30", "2001-07-31")),
                 ind = c(1,0,1,0))

> want
  ID      date1      date2 ind
1  1 2000-01-01 2000-01-15 1
2  1 2000-01-16 2000-01-28 0
3  1 2000-01-29 2001-04-30 1
4  1 2001-05-01 2001-07-31 0

我想做的是
1.在下限日期和上限日期之间添加缺少日期间隔的行(我以前的帖子)
1.使用日期楼层和天花板,这些日期因组ID而异
1.排除给定下限日期之前的日期范围
1.创建一个指标变量ind,跟踪起始数据中存在的日期范围(观察到特征的时间= 1)或不存在(未观察到特征的时间= 0)
到目前为止,我已经从链接的帖子中获得了这段代码,但输出并不完全是我想要的:

library(dplyr)
ex <- have %>%
  group_by(ID) %>%
  summarize(newdate1 = c(start, date2+1), newdate2 = c(date1-1, end)) %>%
  rename(date1 = newdate1, date2 = newdate2) %>%
  filter(date2 > date1) %>%
  ungroup() %>%
  bind_rows(have) %>%
  arrange(ID, date1)

> ex
# A tibble: 5 × 6
     ID date1      date2      start      end          ind
  <dbl> <date>     <date>     <date>     <date>     <dbl>
1     1 1999-09-01 2000-01-15 2000-01-01 2001-07-31     1
2     1 2000-01-01 2000-01-28 NA         NA            NA
3     1 2000-01-16 2001-07-31 NA         NA            NA
4     1 2000-01-29 2001-04-30 2000-01-01 2001-07-31     1
5     1 2001-05-01 2001-07-31 NA         NA            NA

谢谢你。

    • 更新:**重新编码时间间隔,使其落入下限日期和上限日期之内是非常简单的:
have2 <- have %>%
  group_by(ID) %>%
  mutate(date1new = ifelse(date1<start & date2>start,start,date1),
         date2new = ifelse(date2>end & date1<end, end, date2),
         date1new = as.Date(date1new, origin = "1970-01-01"),
         date2new = as.Date(date2new, origin = "1970-01-01")) %>%
  select(ID, (start:date2new)) %>%
  rename(date1 = date1new, date2 = date2new) %>%
ungroup()

> have2
# A tibble: 2 × 6
     ID start      end          ind date1      date2     
  <dbl> <date>     <date>     <dbl> <date>     <date>    
1     1 2000-01-01 2001-07-31     1 2000-01-01 2000-01-15
2     1 2000-01-01 2001-07-31     1 2000-01-29 2001-04-30

我仍然无法使用示例脚本获得所需的输出。

ex2 <- have2 %>%
  group_by(ID) %>%
  summarize(newdate1 = c(start, date2+1), newdate2 = c(date1-1, end)) %>%
  rename(date1 = newdate1, date2 = newdate2) %>%
  filter(date2 > date1) %>%
  ungroup() %>%
  bind_rows(have2) %>%
  arrange(ID, date1)

> ex2
# A tibble: 5 × 6
     ID date1      date2      start      end          ind
  <dbl> <date>     <date>     <date>     <date>     <dbl>
1     1 2000-01-01 2000-01-28 NA         NA            NA
2     1 2000-01-01 2000-01-15 2000-01-01 2001-07-31     1
3     1 2000-01-16 2001-07-31 NA         NA            NA
4     1 2000-01-29 2001-04-30 2000-01-01 2001-07-31     1
5     1 2001-05-01 2001-07-31 NA         NA            NA
xlpyo6sf

xlpyo6sf1#

你就快成功了。主要错误是将startend值作为列而不是作为值传递

newdate1 = c(start, date2+1), newdate2 = c(date1-1, end)

这样,它在开始时添加整个start列,而不仅仅是一个值。这就是为什么您的输出有额外的行。你想做一个类似start[1]的东西。
除此之外,我们还可以做一些小的改进。
1.在summarize中创建id = 0列;
1.让你的mutate更紧凑;
1.使用过滤器完全删除两个日期位于start之前或end之后的行;
1.取消选择bind_rows()之前的startend列。

have2 <- have %>%
  group_by(ID) %>%
  filter(! (date1 < start & date2 < start) | (date1 > end & date2 > end)) %>% # Item 3
  mutate(date1 = ifelse(date1<start,start,date1) %>% as.Date(origin = "1970-01-01"),
         date2 = ifelse(date2>end, end, date2)  %>% as.Date(origin = "1970-01-01")) # Item 2
# You don't necessarily need to ungroup, as we want grouped data for the summarize

have2 %>%
  summarize(newdate1 = c(start[1], date2+1),
            newdate2 = c(date1-1, end[1]),
            ind = 0) %>% # Item 1
  rename(date1 = newdate1, date2 = newdate2) %>%
  filter(date2 > date1) %>%
  ungroup() %>%
  bind_rows(select(have2, -c(start, end))) %>% # Item 4
  arrange(ID, date1)
    • 结果:**
# A tibble: 4 × 4
     ID date1      date2        ind
  <dbl> <date>     <date>     <dbl>
1     1 2000-01-01 2000-01-15     1
2     1 2000-01-16 2000-01-28     0
3     1 2000-01-29 2001-04-30     1
4     1 2001-05-01 2001-07-31     0
    • 备选mutate:**

如果你知道你的数据总是在递增的日期,你可以只改变date1的第一个值和date2的最后一个值:

mutate(date1 = c(if(date1[1] < start[1]) start[1] else date1[1], date1[-1]),
       date2 = c(date2[-n()], if(date2[n()] > end[1]) end[1] else date2[n()]))

这样可以避免不必要的检查,并且不会丢失日期格式。

相关问题