R:生存期分析的重构数据

tzcvj98z  于 2023-02-20  发布在  其他
关注(0)|答案(1)|浏览(149)

我正在使用R编程语言。
我有以下关于内科病人的数据:

my_data = data.frame(id = c(1,2,3), status_2017 = c("alive", "alive", "alive"), status_2018 = c("alive", "dead", "alive"), status_2019 = c("alive", "dead", "dead"), height_2017 = rnorm(3,3,3), height_2018 = rnorm(3,3,3), 
                     height_2019 = rnorm(3,3,3) , weight_2017  = rnorm(3,3,3), weight_2018 = rnorm(3,3,3), weight_2019 = rnorm(3,3,3))

cols <- colnames(my_data)
ix <- my_data[, startsWith(cols, "status")] == "dead"

my_data[, startsWith(cols, "height")][ ix ] <- NA
my_data[, startsWith(cols, "weight")][ ix ] <- NA

这看起来像这样:

id status_2017 status_2018 status_2019 height_2017 height_2018 height_2019 weight_2017 weight_2018 weight_2019
1  1       alive       alive       alive   3.7276706    4.524869   -1.648458   -1.702781    7.755581    3.369895
2  2       alive        dead        dead   0.7539518          NA          NA    1.060408          NA          NA
3  3       alive       alive        dead   6.6213771    2.122374          NA    5.114120    1.851467          NA
    • 我的问题:**我希望重新构造此数据,以便:
  • 每位患者每年都有自己的行
  • 有一个"年份"列
  • 状态_2017、状态_2018、状态_2019全部合并为一列(即"状态")
  • Height_2017、Height_2018、Height_2019全部合并为一列(即"高度")
  • 权重_2017、权重_2018、权重_2019全部合并为一列(即"权重")
  • 创建一个新变量("new_var"),如果患者id有一行为2019,则new_var始终为0-对于所有其他患者id,new_var为0,直到最大年份(然后new_var为1)

我试着用下面的代码来完成这个任务:

library(dplyr)
library(tidyr)

my_data_long <- na.omit(my_data %>%
    pivot_longer(cols = -c(id, status_2017),
                 names_to = c(".value", "year"),
                 names_pattern = "(height|weight)_(\\d{4})") %>%
    arrange(id, year))

final = my_data_long  %>%
  group_by(id) %>%
  mutate(
    new_var = ifelse(any(year == "2019"), 0, 1),
    max_year = max(year)
  ) %>%
  ungroup() %>%
  mutate(
    new_var = ifelse(year == max_year & new_var == 1, 1, 0),
    max_year = NULL
  )

最终结果如下所示:

> final
# A tibble: 6 x 6
     id status_2017 year  height weight new_var
  <dbl> <chr>       <chr>  <dbl>  <dbl>   <dbl>
1     1 alive       2017   2.39    2.27       0
2     1 alive       2018  -0.541   1.63       0
3     1 alive       2019  -1.93   10.1        0
4     2 alive       2017   4.18   -3.35       1
5     3 alive       2017  -1.35    7.12       0
6     3 alive       2018   1.42    1.70       1

我的最终目标是重新构造这个数据集,以便我可以将"时变生存分析模型"(例如cox-ph)拟合到这个数据(例如https://atm.amegroups.com/article/view/18820/htmlhttps://cran.r-project.org/web/pacacages/survival/vignettes/timedep.pdf

    • 有人能告诉我我做得对不对吗**

谢谢!

  • 注意:我尝试为每个ID添加时差

这看起来像这样:

library(stringr)

final %>%
  group_by(id) %>%
  mutate(start = 0:(n() - 1),
         end = 1:n()) %>%
  ungroup()

# A tibble: 6 x 8
     id status_2017 year  height weight new_var start   end
  <dbl> <chr>       <chr>  <dbl>  <dbl>   <dbl> <int> <int>
1     1 alive       2017   2.39    2.27       0     0     1
2     1 alive       2018  -0.541   1.63       0     1     2
3     1 alive       2019  -1.93   10.1        0     2     3
4     2 alive       2017   4.18   -3.35       1     0     1
5     3 alive       2017  -1.35    7.12       0     0     1
6     3 alive       2018   1.42    1.70       1     1     2
wydwbb8l

wydwbb8l1#

如果我们需要status列,则必须将这些列也包含在旋转到long中,即cols = -c(id, status_2017)从整形中删除"status_2017"。此外,除了heightweight之外,names_pattern还需要包含status

library(dplyr) # version >= 1.1.0
library(tidyr)
my_data %>%
  pivot_longer(cols = -id, names_to = c(".value", "year"),
   names_pattern = "(height|weight|status)_(\\d{4})") %>%
  drop_na() %>% 
 mutate(new_var = +(2019 %in% year), max_year = max(year), .by = "id") %>% 
 mutate(new_var = +(year == max_year & new_var), max_year = NULL)
  • 输出
# A tibble: 6 × 6
     id year  status height weight new_var
  <dbl> <chr> <chr>   <dbl>  <dbl>   <int>
1     1 2017  alive   9.54   7.47        0
2     1 2018  alive   6.49   5.23        0
3     1 2019  alive   3.75   1.93        1
4     2 2017  alive   4.21   0.619       0
5     3 2017  alive   1.97   5.32        0
6     3 2018  alive  -0.406  8.00        0

相关问题