我正在使用R编程语言。
我有以下关于患者的医学特征和疾病患病率的数据集:
set.seed(123)
library(dplyr)
Patient_ID = 1:5000
gender <- c("Male","Female")
gender <- sample(gender, 5000, replace=TRUE, prob=c(0.45, 0.55))
Gender <- as.factor(gender)
status <- c("Immigrant","Citizen")
status <- sample(status, 5000, replace=TRUE, prob=c(0.3, 0.7))
Status <- as.factor(status )
Height = rnorm(5000, 150, 10)
Weight = rnorm(5000, 90, 10)
Hospital_Visits = sample.int(20, 5000, replace = TRUE)
################
disease <- c("Yes","No")
disease <- sample(disease, 5000, replace=TRUE, prob=c(0.4, 0.6))
Disease <- as.factor(disease)
###################
my_data = data.frame(Patient_ID, Gender, Status, Height, Weight, Hospital_Visits, Disease)
Patient_ID Gender Status Height Weight Hospital_Visits Disease
1 1 Female Citizen 145.0583 113.70725 1 No
2 2 Male Immigrant 161.2759 88.33188 18 No
3 3 Female Immigrant 138.5305 99.26961 6 Yes
4 4 Male Citizen 164.8102 84.31848 12 No
5 5 Male Citizen 159.1619 92.25090 12 Yes
6 6 Female Citizen 153.3513 101.31986 11 Yes
基于此数据集,我试图计算“嵌套组”中的疾病比例,即
- 首先,选择所有男性
- 然后,选择所有男性公民
- 然后,从所有男性公民的集合中找出一组身高最小的20%的人
- 然后,在最短的20%身高内的所有男性公民的集合中-进一步分离具有最小重量的20%的组。
- 最后,在最短20%身高内和最短20%身高内具有20%最小体重的所有男性公民的集合中,进一步将他们隔离到具有20%最少医院就诊次数的组中:这将是第一组
- 对所有可能的组组合重复此过程
**第1部分:**在DPLYR中使用“.add = TRUE”参数,我想我可以如下完成:
nested_combinations <- my_data %>%
group_by(Gender) %>%
group_by(Status, add = TRUE) %>%
mutate(height_group = ntile(Height, 5)) %>%
group_by(height_group, add = TRUE) %>%
mutate(weight_group = ntile(Weight, 5)) %>%
group_by(weight_group, add = TRUE) %>%
mutate(visits_group = ntile(Hospital_Visits, 5)) %>%
group_by(visits_group, add = TRUE) %>%
summarize(total_count = n(),
disease_count = sum(Disease == "Yes"),
disease_proportion = mean(Disease == "Yes"))
# results
Gender Status height_group weight_group visits_group total_count disease_count disease_proportion
<fct> <fct> <int> <int> <int> <int> <int> <dbl>
1 Female Citizen 1 1 1 16 5 0.312
2 Female Citizen 1 1 2 16 4 0.25
3 Female Citizen 1 1 3 16 7 0.438
4 Female Citizen 1 1 4 15 4 0.267
5 Female Citizen 1 1 5 15 8 0.533
6 Female Citizen 1 2 1 16 5 0.312
7 Female Citizen 1 2 2 16 4 0.25
8 Female Citizen 1 2 3 16 8 0.5
9 Female Citizen 1 2 4 15 6 0.4
10 Female Citizen 1 2 5 15 6 0.4
**第2部分:**接下来,我通过计算“范围”(即min和max):
table_data <- data.frame(
Groups = paste0("Group ", 1:5),
Min_Height = tapply(my_data$Height, ntile(my_data$Height, 5), min),
Max_Height = tapply(my_data$Height, ntile(my_data$Height, 5), max),
Min_Weight = tapply(my_data$Weight, ntile(my_data$Weight, 5), min),
Max_Weight = tapply(my_data$Weight, ntile(my_data$Weight, 5), max),
Min_Visits = tapply(my_data$Hospital_Visits, ntile(my_data$Hospital_Visits, 5), min),
Max_Visits = tapply(my_data$Hospital_Visits, ntile(my_data$Hospital_Visits, 5), max)
)
# results
Groups Min_Height Max_Height Min_Weight Max_Weight Min_Visits Max_Visits
1 Group 1 111.5468 141.4839 56.53098 81.83402 1 4
2 Group 2 141.4965 147.4422 81.85064 87.45406 4 8
3 Group 3 147.4487 152.3924 87.45935 92.72041 8 12
4 Group 4 152.4016 158.5178 92.72941 98.54624 12 17
5 Group 5 158.5187 188.4777 98.55533 121.02420 17 20
**我的问题:**是否有一种方法可以将第2部分中不同变量的最小/最大范围作为新列插入到第1部分中(例如min_height、max_height、min_weight、max_weight、min_visits、max_visits)?
目前,我正在使用一系列“ifelse”语句来实现这一点,但这似乎不是很有效。有没有人能告诉我一个更好的方法?
谢谢!
1条答案
按热度按时间ozxc1zmp1#
这方面的事吗