当R列中存在相同的开始日期和结束日期字段时,按ID对行进行分组和合并

epfja78i  于 2023-06-19  发布在  其他
关注(0)|答案(3)|浏览(104)

我有一个带有ID、Start_date和End_date字段的数据集。此数据集中的某些记录的结束日期与相同ID的开始日期匹配,但位于不同的行中。例如,一个ID的日期可以是2002-01-20到2002-01-22,而另一个记录的日期可以是2002-01-22到2002-01-23。我希望将这些记录合并在一起,这样我就只有一条ID的记录,该记录的开始日期是第一条记录(2002-01-20),结束日期是第二条记录(2002-01-23)。
原始数据示例:

structure(list(ID = c(565, 898, 521, 522, 522, 323, 887, 887, 
                      522), Start_date = structure(c(12846, 19172, 19341, 19495, 19497, 
                                                     19495, 14194, 14204, 18786), class = "Date"), End_date = structure(c(12847, 
                                                                                                                          19174, 19347, 19497, 19499, 19497, 14203, 14206, 18798), class = "Date")), row.names = c(NA, 
                                                                                                                                                                                                                   -9L), class = c("tbl_df", "tbl", "data.frame"))

示例最终数据:

structure(list(ID = c(565, 898, 521, 522, 323, 887, 887, 522), 
               Start_date = structure(c(12846, 19172, 19341, 19495, 19495, 
                                        14194, 14204, 18786), class = "Date"), End_date = structure(c(12847, 
                                                                                                      19174, 19347, 19499, 19497, 14203, 14206, 18798), class = "Date")), row.names = c(NA, 
                                                                                                                                                                                        -8L), class = c("tbl_df", "tbl", "data.frame"))

对于这个问题,似乎有很多关于stackoverflow的SQL解决方案,但是我无法使用R找到一个。谢谢你。

wmomyfyw

wmomyfyw1#

首先,我们可以使用inner_join将数据集连接到自身,并只保留具有相同ID和匹配的开始/结束日期的行:

library(tidyverse)

dat <- structure(list(ID = c(565, 898, 521, 522, 522, 323, 887, 887, 
                             522), Start_date = structure(c(12846, 19172, 19341, 19495, 19497, 
                                                            19495, 14194, 14204, 18786), class = "Date"), End_date = structure(c(12847, 
                                                                                                                                 19174, 19347, 19497, 19499, 19497, 14203, 14206, 18798), class = "Date")), row.names = c(NA, 
                                                                                                                                                                                                                          -9L), class = c("tbl_df", "tbl", "data.frame"))
processed <- structure(list(ID = c(565, 898, 521, 522, 323, 887, 887, 522), 
                            Start_date = structure(c(12846, 19172, 19341, 19495, 19495, 
                                                     14194, 14204, 18786), class = "Date"), End_date = structure(c(12847, 
                                                                                                                   19174, 19347, 19499, 19497, 14203, 14206, 18798), class = "Date")), row.names = c(NA, 
                                                                                                                                                                                                     -8L), class = c("tbl_df", "tbl", "data.frame"))

dat |>
  inner_join(dat,
            by = c("ID", "End_date" = "Start_date"))
#> # A tibble: 1 × 4
#>      ID Start_date End_date   End_date.y
#>   <dbl> <date>     <date>     <date>    
#> 1   522 2023-05-18 2023-05-20 2023-05-22

我们重命名列以仅保留正确的End_date:

dat |>
  inner_join(dat,
            by = c("ID", "End_date" = "Start_date")) |>
  select(ID, Start_date, End_date = End_date.y)

然后我们执行相同的操作,使用anti_join删除这些行。因为我们要删除两行(一行是开始日期,另一行是结束日期),所以我们需要执行两次:

dat |>
  anti_join(dat,
             by = c("ID", "End_date" = "Start_date")) |>
  anti_join(dat,
            by = c("ID", "Start_date" = "End_date"))
#> # A tibble: 8 × 3
#>      ID Start_date End_date  
#>   <dbl> <date>     <date>    
#> 1   565 2005-03-04 2005-03-05
#> 2   898 2022-06-29 2022-07-01
#> 3   521 2022-12-15 2022-12-21
#> 4   323 2023-05-18 2023-05-20
#> 5   887 2008-11-11 2008-11-20
#> 6   887 2008-11-21 2008-11-23
#> 7   522 2021-06-08 2021-06-20

我们完成了,我们可以把这两个结合起来:

combined <- dat |>
  inner_join(dat,
            by = c("ID", "End_date" = "Start_date")) |>
  select(ID, Start_date, End_date = End_date.y)

res <- dat |>
  anti_join(dat,
             by = c("ID", "End_date" = "Start_date")) |>
  anti_join(dat,
            by = c("ID", "Start_date" = "End_date")) |>
  bind_rows(combined)

all.equal(arrange(res, ID, Start_date),
          arrange(processed, ID, Start_date))
#> [1] TRUE
ijxebb2r

ijxebb2r2#

library(dplyr)

df |> 
  mutate(End_date = as.Date(ifelse(End_date %in% Start_date, max(End_date), End_date)),
         .by = ID) |>
  distinct(ID, End_date, .keep_all = T)

如何运作

1.在ID(例如.by = ID)中,我们找到End_date也是另一行中Start_date的行。
1.当满足该条件时,在ID内,对于该行,我们将End_date设置为最大日期。max date返回最近的按时间顺序排列的日期。当此操作完成后,End_date现在将为给定的ID复制到两行中。
1.最后,我们可以简单地使用distinct删除该行,这将保留第一行在IDEnd_date中的唯一值。
注意:.by是实验性的,从packageVersion("dplyr") 1.1.0开始是新的。如果您使用的是旧版本的dplyr,则可以执行以下操作:

df |>
  group_by(ID) |>
  mutate(...) |> # ... same code as above
  ungroup() |>
  distinct(...) # ... same code as above

输出

ID Start_date End_date  
  <dbl> <date>     <date>    
1   565 2005-03-04 2005-03-05
2   898 2022-06-29 2022-07-01
3   521 2022-12-15 2022-12-21
4   522 2023-05-18 2023-05-22
5   323 2023-05-18 2023-05-20
6   887 2008-11-11 2008-11-20
7   887 2008-11-21 2008-11-23
8   522 2021-06-08 2021-06-20

一个警告是以下场景,其中ID中有匹配的Start_dateEnd_date,但有一个额外的不匹配行,其日期更近(第三行):

df2 <- data.frame(ID = '522',
           Start_date = as.Date(c("2023-05-18", "2023-05-20", "2023-05-25")),
           End_date = as.Date(c("2023-05-20", "2023-05-22", "2023-05-28")))

然后,您可以执行以下操作:

df2 |> 
  mutate(End_date = as.Date(ifelse(End_date %in% Start_date, End_date[which(Start_date %in% End_date)], End_date)),
         .by = ID) |>
  distinct(ID, End_date, .keep_all = T)
xqnpmsa8

xqnpmsa83#

我最终使用以下代码解决了这个问题,这说明了我的数据中的一些细微差别,包括(1)希望保留与记录相关的原始数据集中的其他变量(2)能够说明可能有3或4个记录的情况,我希望链接在一起;以及(3)ID可以并且确实在我的数据集中重复的事实,一些我希望合并,而另一些我希望保持分离,因为它们不符合我的标准(相同的结束/开始日期)。

raw_data <- raw_data %>%
  arrange(ID, START_DATE)

raw_data <- raw_data %>%
  group_by(ID) %>%
  mutate(Combined = ifelse(!is.na(lag(END_DATE)) & START_DATE == lag(END_DATE), TRUE, FALSE)) %>%
  ungroup()

multiples <- raw_data %>%  
  mutate(Group = cumsum(!Combined)) %>%
  group_by(Group) %>%
  filter(n() > 1) %>%
  ungroup()

singles <- raw_data %>%  
  mutate(Group = cumsum(!Combined)) %>%
  group_by(Group) %>%
  filter(n() == 1) %>%
  ungroup() %>%
  select(-Combined)

multiples <- multiples %>%  
  group_by(ID, Group) %>%
  summarize(START_DATE = first(START_DATE),
            END_DATE = last(END_DATE)) %>%
  ungroup()

raw_data <- raw_data %>%
  select(-START_DATE, -END_DATE, -Combined)

#This step is so I can reintroduce the additional variables of interest back into my dataset, after I combined the records I wanted combined.
multiples <- multiples %>%
  left_join(raw_data, by = "ID")

#This deduplicates my records after the left join, leaving only a single record for each
multiples <- setDT(multiples)[order(ID, Group), .SD[1], by=list(ID, Group)]

final_dataset <- rbind(multiples, singles)

相关问题