在R中的每个组中创建数字时间格式的时间间隔周期

8fq7wneg  于 2022-12-24  发布在  其他
关注(0)|答案(2)|浏览(124)

我想为每个组创建一个时间间隔周期,我的时间格式是数字格式。假设我想从第一条记录开始间隔1小时,在1小时内的每条记录都将是interval 1,从第一条记录开始间隔1小时后小于2小时的任何记录都将是interval 2,依此类推(对于每个user组)。
从技术上讲,我正在寻找创建一个小时的垃圾箱从一开始。

df<-read.table(text="
user     timestart
1        1421286975
1        1421287343
1        1421470513
1        1421470513
1        1421471816
1        1421806839
2        1424217068
2        1424217150
2        1424218395",header=T,stringsAsFactors = F)

# result: (might not 100% accurate but you get the point)
user    timestart    interval_1h
1       1421286975     1
1       1421287343     1
1       1421470513     2
1       1421470513     2
1       1421471816     2
1       1421806839     3
2       1424217068     1
2       1424217150     1
2       1424218395     1
9udxz4iz

9udxz4iz1#

对我来说,这个问题有两种解释,下面是两种解释的解决方案,我们使用dplyr来获得想要的输出:
1.第一种解释创建了一个输出,该输出类似于所显示的输出,但与您的实际问题相矛盾:

df %>% 
  mutate(time = as.POSIXlt(timestart, origin = "1970-01-01")) %>% 
  group_by(user) %>% 
  mutate(grp = cumsum(coalesce(difftime(time, lag(time), units = "hours") >= 1, TRUE))) %>% 
  group_by(user, grp) %>% 
  mutate(grp2 = difftime(time, first(time), units = "hours") >= 1) %>% 
  group_by(user) %>% 
  mutate(grp = grp + cumsum(grp2), .keep = "unused") %>% 
  ungroup()

这将返回

# A tibble: 10 x 4
    user  timestart time                  grp
   <int>      <int> <dttm>              <int>
 1     1 1421286975 2015-01-15 02:56:15     1
 2     1 1421287343 2015-01-15 03:02:23     1
 3     1 1421470513 2015-01-17 05:55:13     2
 4     1 1421470513 2015-01-17 05:55:13     2
 5     1 1421471816 2015-01-17 06:16:56     2
 6     1 1421475400 2015-01-17 07:16:40     3
 7     1 1421806839 2015-01-21 03:20:39     4
 8     2 1424217068 2015-02-18 00:51:08     1
 9     2 1424217150 2015-02-18 00:52:30     1
10     2 1424218395 2015-02-18 01:13:15     1

1.第二个方法获取每个用户的第一个timestart,并创建1小时的时隙,每个后续时间戳被分配给其中一个时隙,并基于这些时隙创建组。

df %>% 
  group_by(user) %>% 
  mutate(time = as.POSIXlt(timestart, origin = "1970-01-01"),
         helper = (timestart %% first(timestart)) %/% 3600,
         grp = cumsum(helper - lag(helper, default = 0) > 0) + 1) %>% 
  ungroup() %>% 
  select(-helper)

这只回来了

# A tibble: 10 x 4
    user  timestart time                  grp
   <int>      <int> <dttm>              <dbl>
 1     1 1421286975 2015-01-15 02:56:15     1
 2     1 1421287343 2015-01-15 03:02:23     1
 3     1 1421470513 2015-01-17 05:55:13     2
 4     1 1421470513 2015-01-17 05:55:13     2
 5     1 1421471816 2015-01-17 06:16:56     3
 6     1 1421475400 2015-01-17 07:16:40     4
 7     1 1421806839 2015-01-21 03:20:39     5
 8     2 1424217068 2015-02-18 00:51:08     1
 9     2 1424217150 2015-02-18 00:52:30     1
10     2 1424218395 2015-02-18 01:13:15     1

数据

我添加了一个数据点以获得更好的示例数据

df <- structure(list(user = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L
), timestart = c(1421286975L, 1421287343L, 1421470513L, 1421470513L, 
1421471816L, 1421475400L, 1421806839L, 1424217068L, 1424217150L, 
1424218395L)), class = "data.frame", row.names = c(NA, -10L))
t9aqgxwy

t9aqgxwy2#

考虑一些具有对ave的多次调用的helper列:

output <- within(
  df, {
    timedt <- as.POSIXct(timestart, origin="1970-01-01")
    first <- ave(timedt, user, FUN=min)
    hour_diff <- round(as.numeric(difftime(timedt, first, unit="hours")))

    interval_1h <- ave(
      ifelse(ave(hour_diff, user, hour_diff, FUN=seq_along) == 1, 1, 0),
      user,
      FUN=cumsum
    )
    rm(timedt, first, hour_diff)
  }
)

output
  user  timestart interval_1h
1    1 1421286975           1
2    1 1421287343           1
3    1 1421470513           2
4    1 1421470513           2
5    1 1421471816           2
6    1 1421806839           3
7    2 1424217068           1
8    2 1424217150           1
9    2 1424218395           1

相关问题