R语言 计算每组的重叠日期间隔数

xcitsw88  于 2023-04-03  发布在  其他
关注(0)|答案(4)|浏览(160)

我有以下 Dataframe df(dput如下):

> df
   group       from         to
1      A 2023-03-01 2023-03-02
2      A 2023-03-01 2023-03-03
3      A 2023-03-03 2023-03-07
4      A 2023-03-05 2023-03-08
5      A 2023-03-09 2023-03-10
6      A 2023-03-11 2023-03-11
7      B 2023-03-01 2023-03-02
8      B 2023-03-04 2023-03-06
9      B 2023-03-07 2023-03-07
10     B 2023-03-08 2023-03-11
11     B 2023-03-10 2023-03-12
12     B 2023-03-15 2023-03-16

我想根据从列和到列计算每组重叠日期间隔的数量。在A组中,第1行和第2行重叠,第3行与第2行和第4行重叠,因此这意味着A组总共有3个重叠间隔。在B组中,只有第10行和第11行重叠。因此我想得到以下输出:

group overlaying_intervals
1     A                    3
2     B                    1

所以我想知道是否有人知道如何计算每组重叠的日期间隔的数量?
dput df:

df <- structure(list(group = c("A", "A", "A", "A", "A", "A", "B", "B", 
"B", "B", "B", "B"), from = c("2023-03-01", "2023-03-01", "2023-03-03", 
"2023-03-05", "2023-03-09", "2023-03-11", "2023-03-01", "2023-03-04", 
"2023-03-07", "2023-03-08", "2023-03-10", "2023-03-15"), to = c("2023-03-02", 
"2023-03-03", "2023-03-07", "2023-03-08", "2023-03-10", "2023-03-11", 
"2023-03-02", "2023-03-06", "2023-03-07", "2023-03-11", "2023-03-12", 
"2023-03-16")), class = "data.frame", row.names = c("1", "2", 
"3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))
8cdiaqws

8cdiaqws1#

感觉应该有一种更优雅的方法来实现这一点,但我的第一个倾向是计算所有重叠的间隔,然后考虑与自我的重叠,并重复计算每一个成对的重叠。

library(lubridate)
library(dplyr)
library(purrr)

df %>%
  group_by(group) %>%
  mutate(int = interval(from, to),
         # count overlapping intervals, subtracting overlap with self
         overlays = (map_int(int, ~sum(int_overlaps(.x, int))))-1) %>%
  # divide total by 2 since each pairwise overlap is counted twice
  summarize(overlaying_intervals = sum(overlays)/2)
#> # A tibble: 2 × 2
#>   group overlaying_intervals
#>   <chr>                <dbl>
#> 1 A                        3
#> 2 B                        1

创建于2023-03-31带有reprex v2.0.2

oalqel3c

oalqel3c2#

一种基本的R方法。

by(df, df$group, \(x){
  dc <- c("from", "to")
  x[dc] <- lapply(x[dc], \(x) as.numeric(as.Date(x)))
  U <- apply(x[dc], 1, \(z) z[1]:z[2])
  outer(U, U, Vectorize(\(x, y) length(intersect(x, y)) > 0)) |> `diag<-`(0) |> sum() |> base::`/`(2)
}) |> as.table() |> as.data.frame()
#   df.group Freq
# 1        A    3
# 2        B    1

创建于2023-03-31由hand

  • 数据:*
df <- structure(list(group = c("A", "A", "A", "A", "A", "A", "B", "B", 
"B", "B", "B", "B"), from = c("2023-03-01", "2023-03-01", "2023-03-03", 
"2023-03-05", "2023-03-09", "2023-03-11", "2023-03-01", "2023-03-04", 
"2023-03-07", "2023-03-08", "2023-03-10", "2023-03-15"), to = c("2023-03-02", 
"2023-03-03", "2023-03-07", "2023-03-08", "2023-03-10", "2023-03-11", 
"2023-03-02", "2023-03-06", "2023-03-07", "2023-03-11", "2023-03-12", 
"2023-03-16")), class = "data.frame", row.names = c("1", "2", 
"3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))
to94eoyn

to94eoyn3#

下面是使用foverlapsdata.table选项

setDT(df)
rev(
  stack(
    lapply(
      split(
        setkey(df[, lapply(.SD, as.IDate), group], from, to),
        by = "group"
      ),
      function(x) {
        foverlaps(x, x, which = TRUE)[xid < yid, .N]
      }
    )
  )
)

它给出了

ind values
1   A      3
2   B      1
iugsix8n

iugsix8n4#

我认为@Seth的想法是正确的,但是您可以通过更有效地计算ivs::iv_count_overlaps()的所有重叠来构建它,这将比按行迭代更有效。
ivs是一个为使用间隔而定制的包,因此它非常适合于此。
关于ivs要知道的主要事情是,间隔是半开的,即[ ),所以你需要在to日期上加1。

library(dplyr, warn.conflicts = FALSE)
library(ivs)

df <- tibble::tribble(
  ~group, ~from, ~to,
  "A", "2023-03-01", "2023-03-02",
  "A", "2023-03-01", "2023-03-03",
  "A", "2023-03-03", "2023-03-07",
  "A", "2023-03-05", "2023-03-08",
  "A", "2023-03-09", "2023-03-10",
  "A", "2023-03-11", "2023-03-11",
  "B", "2023-03-01", "2023-03-02",
  "B", "2023-03-04", "2023-03-06",
  "B", "2023-03-07", "2023-03-07",
  "B", "2023-03-08", "2023-03-11",
  "B", "2023-03-10", "2023-03-12",
  "B", "2023-03-15", "2023-03-16"
)

df <- df %>%
  mutate(from = as.Date(from), to = as.Date(to)) %>%
  mutate(range = iv(from, to + 1L), .keep = "unused")

df
#> # A tibble: 12 × 2
#>    group                    range
#>    <chr>               <iv<date>>
#>  1 A     [2023-03-01, 2023-03-03)
#>  2 A     [2023-03-01, 2023-03-04)
#>  3 A     [2023-03-03, 2023-03-08)
#>  4 A     [2023-03-05, 2023-03-09)
#>  5 A     [2023-03-09, 2023-03-11)
#>  6 A     [2023-03-11, 2023-03-12)
#>  7 B     [2023-03-01, 2023-03-03)
#>  8 B     [2023-03-04, 2023-03-07)
#>  9 B     [2023-03-07, 2023-03-08)
#> 10 B     [2023-03-08, 2023-03-12)
#> 11 B     [2023-03-10, 2023-03-13)
#> 12 B     [2023-03-15, 2023-03-17)

# Count all overlaps, then:
# - Subtract 1 for self-overlaps
# - Divide by 2 to get rid of doubly counted pairwise overlaps
df %>%
  mutate(count = iv_count_overlaps(range, range), .by = group) %>%
  mutate(count = count - 1L) %>%
  summarise(count = sum(count) / 2, .by = group)
#> # A tibble: 2 × 2
#>   group count
#>   <chr> <dbl>
#> 1 A         3
#> 2 B         1

相关问题