我正在使用R编程语言。
我有以下关于内科病人的数据:
my_data = data.frame(id = c(1,2,3), status_2017 = c("alive", "alive", "alive"), status_2018 = c("alive", "dead", "alive"), status_2019 = c("alive", "dead", "dead"), height_2017 = rnorm(3,3,3), height_2018 = rnorm(3,3,3),
height_2019 = rnorm(3,3,3) , weight_2017 = rnorm(3,3,3), weight_2018 = rnorm(3,3,3), weight_2019 = rnorm(3,3,3))
cols <- colnames(my_data)
ix <- my_data[, startsWith(cols, "status")] == "dead"
my_data[, startsWith(cols, "height")][ ix ] <- NA
my_data[, startsWith(cols, "weight")][ ix ] <- NA
这看起来像这样:
id status_2017 status_2018 status_2019 height_2017 height_2018 height_2019 weight_2017 weight_2018 weight_2019
1 1 alive alive alive 3.7276706 4.524869 -1.648458 -1.702781 7.755581 3.369895
2 2 alive dead dead 0.7539518 NA NA 1.060408 NA NA
3 3 alive alive dead 6.6213771 2.122374 NA 5.114120 1.851467 NA
- 我的问题:**我希望重新构造此数据,以便:
- 每位患者每年都有自己的行
- 有一个"年份"列
- 状态_2017、状态_2018、状态_2019全部合并为一列(即"状态")
- Height_2017、Height_2018、Height_2019全部合并为一列(即"高度")
- 权重_2017、权重_2018、权重_2019全部合并为一列(即"权重")
- 创建一个新变量("new_var"),如果患者id有一行为2019,则new_var始终为0-对于所有其他患者id,new_var为0,直到最大年份(然后new_var为1)
我试着用下面的代码来完成这个任务:
library(dplyr)
library(tidyr)
my_data_long <- na.omit(my_data %>%
pivot_longer(cols = -c(id, status_2017),
names_to = c(".value", "year"),
names_pattern = "(height|weight)_(\\d{4})") %>%
arrange(id, year))
final = my_data_long %>%
group_by(id) %>%
mutate(
new_var = ifelse(any(year == "2019"), 0, 1),
max_year = max(year)
) %>%
ungroup() %>%
mutate(
new_var = ifelse(year == max_year & new_var == 1, 1, 0),
max_year = NULL
)
最终结果如下所示:
> final
# A tibble: 6 x 6
id status_2017 year height weight new_var
<dbl> <chr> <chr> <dbl> <dbl> <dbl>
1 1 alive 2017 2.39 2.27 0
2 1 alive 2018 -0.541 1.63 0
3 1 alive 2019 -1.93 10.1 0
4 2 alive 2017 4.18 -3.35 1
5 3 alive 2017 -1.35 7.12 0
6 3 alive 2018 1.42 1.70 1
我的最终目标是重新构造这个数据集,以便我可以将"时变生存分析模型"(例如cox-ph)拟合到这个数据(例如https://atm.amegroups.com/article/view/18820/html,https://cran.r-project.org/web/pacacages/survival/vignettes/timedep.pdf)
- 有人能告诉我我做得对不对吗**
谢谢!
- 注意:我尝试为每个ID添加时差
这看起来像这样:
library(stringr)
final %>%
group_by(id) %>%
mutate(start = 0:(n() - 1),
end = 1:n()) %>%
ungroup()
# A tibble: 6 x 8
id status_2017 year height weight new_var start end
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <int> <int>
1 1 alive 2017 2.39 2.27 0 0 1
2 1 alive 2018 -0.541 1.63 0 1 2
3 1 alive 2019 -1.93 10.1 0 2 3
4 2 alive 2017 4.18 -3.35 1 0 1
5 3 alive 2017 -1.35 7.12 0 0 1
6 3 alive 2018 1.42 1.70 1 1 2
1条答案
按热度按时间wydwbb8l1#
如果我们需要
status
列,则必须将这些列也包含在旋转到long中,即cols = -c(id, status_2017)
从整形中删除"status_2017"。此外,除了height
和weight
之外,names_pattern
还需要包含status