R:将查找表与数据框合并

cvxl0en2  于 2023-06-27  发布在  其他
关注(0)|答案(1)|浏览(110)

我正在使用R编程语言。
我有以下关于患者的医学特征和疾病患病率的数据集:

set.seed(123)
library(dplyr)

Patient_ID = 1:5000
gender <- c("Male","Female")
gender <- sample(gender, 5000, replace=TRUE, prob=c(0.45, 0.55))
Gender <- as.factor(gender)

status <- c("Immigrant","Citizen")
status <- sample(status, 5000, replace=TRUE, prob=c(0.3, 0.7))
Status  <- as.factor(status )

Height = rnorm(5000, 150, 10)
Weight = rnorm(5000, 90, 10)
Hospital_Visits = sample.int(20,  5000, replace = TRUE)

################

disease <- c("Yes","No")
disease <- sample(disease, 5000, replace=TRUE, prob=c(0.4, 0.6))
Disease <- as.factor(disease)

###################
my_data = data.frame(Patient_ID, Gender, Status, Height, Weight, Hospital_Visits, Disease)

  Patient_ID Gender    Status   Height    Weight Hospital_Visits Disease
1          1 Female   Citizen 145.0583 113.70725               1      No
2          2   Male Immigrant 161.2759  88.33188              18      No
3          3 Female Immigrant 138.5305  99.26961               6     Yes
4          4   Male   Citizen 164.8102  84.31848              12      No
5          5   Male   Citizen 159.1619  92.25090              12     Yes
6          6 Female   Citizen 153.3513 101.31986              11     Yes

基于此数据集,我试图计算“嵌套组”中的疾病比例,即

  • 首先,选择所有男性
  • 然后,选择所有男性公民
  • 然后,从所有男性公民的集合中找出一组身高最小的20%的人
  • 然后,在最短的20%身高内的所有男性公民的集合中-进一步分离具有最小重量的20%的组。
  • 最后,在最短20%身高内和最短20%身高内具有20%最小体重的所有男性公民的集合中,进一步将他们隔离到具有20%最少医院就诊次数的组中:这将是第一组
  • 对所有可能的组组合重复此过程
    **第1部分:**在DPLYR中使用“.add = TRUE”参数,我想我可以如下完成:
nested_combinations <- my_data %>%
  group_by(Gender) %>%
  group_by(Status, add = TRUE) %>%
  mutate(height_group = ntile(Height, 5)) %>%
  group_by(height_group, add = TRUE) %>%
  mutate(weight_group = ntile(Weight, 5)) %>%
  group_by(weight_group, add = TRUE) %>%
  mutate(visits_group = ntile(Hospital_Visits, 5)) %>%
  group_by(visits_group, add = TRUE) %>%
  summarize(total_count = n(),
            disease_count = sum(Disease == "Yes"),
            disease_proportion = mean(Disease == "Yes"))

# results 

  Gender Status  height_group weight_group visits_group total_count disease_count disease_proportion
   <fct>  <fct>          <int>        <int>        <int>       <int>         <int>              <dbl>
 1 Female Citizen            1            1            1          16             5              0.312
 2 Female Citizen            1            1            2          16             4              0.25 
 3 Female Citizen            1            1            3          16             7              0.438
 4 Female Citizen            1            1            4          15             4              0.267
 5 Female Citizen            1            1            5          15             8              0.533
 6 Female Citizen            1            2            1          16             5              0.312
 7 Female Citizen            1            2            2          16             4              0.25 
 8 Female Citizen            1            2            3          16             8              0.5  
 9 Female Citizen            1            2            4          15             6              0.4  
10 Female Citizen            1            2            5          15             6              0.4

**第2部分:**接下来,我通过计算“范围”(即min和max):

table_data <- data.frame(
 Groups = paste0("Group ", 1:5),
  Min_Height = tapply(my_data$Height, ntile(my_data$Height, 5), min),
  Max_Height = tapply(my_data$Height, ntile(my_data$Height, 5), max),
  Min_Weight = tapply(my_data$Weight, ntile(my_data$Weight, 5), min),
  Max_Weight = tapply(my_data$Weight, ntile(my_data$Weight, 5), max),
 Min_Visits = tapply(my_data$Hospital_Visits, ntile(my_data$Hospital_Visits, 5), min),
Max_Visits = tapply(my_data$Hospital_Visits, ntile(my_data$Hospital_Visits, 5), max)
)

# results

   Groups Min_Height Max_Height Min_Weight Max_Weight Min_Visits Max_Visits
1 Group 1   111.5468   141.4839   56.53098   81.83402          1          4
2 Group 2   141.4965   147.4422   81.85064   87.45406          4          8
3 Group 3   147.4487   152.3924   87.45935   92.72041          8         12
4 Group 4   152.4016   158.5178   92.72941   98.54624         12         17
5 Group 5   158.5187   188.4777   98.55533  121.02420         17         20

**我的问题:**是否有一种方法可以将第2部分中不同变量的最小/最大范围作为新列插入到第1部分中(例如min_height、max_height、min_weight、max_weight、min_visits、max_visits)?

目前,我正在使用一系列“ifelse”语句来实现这一点,但这似乎不是很有效。有没有人能告诉我一个更好的方法?
谢谢!

ozxc1zmp

ozxc1zmp1#

这方面的事吗

## example data
d <-
d <- data.frame(gender = gl(2, 10),
           height = 160 + sample(1:40, 40),
           weight = 50 + sample(1:50, 40),
           disease = sample(c(TRUE, FALSE), 40, replace = TRUE)
           )
d |>
  group_by(gender) |>
  mutate(low_height = height < quantile(height, .2)) |>
  group_by(gender, low_height) |>
  mutate(low_weight = weight < quantile(weight, .2)) |>
  group_by(gender, low_height, low_weight) |>
  summarise(across(c(height, weight),
                   ## list custom stats here:
                   list(min = \(xs) min(xs, na.rm = TRUE),
                        max = \(xs) max(xs, na.rm = TRUE)
                        ),
                   .names = "{.col}_{.fn}"
                   ),
            prop_disease = sum(disease)/n(),
            ## etc.
)
# A tibble: 8 x 8
# Groups:   gender, low_height [4]
  gender low_height low_weight height_min height_max weight_min weight_max
  <fct>  <lgl>      <lgl>           <dbl>      <dbl>      <dbl>      <dbl>
1 1      FALSE      FALSE             172        199         67        100
2 1      FALSE      TRUE              173        190         52         65
3 1      TRUE       FALSE             161        169         74         94
4 1      TRUE       TRUE              165        165         61         61
5 2      FALSE      FALSE             168        200         56         96
6 2      FALSE      TRUE              170        192         51         54
7 2      TRUE       FALSE             164        167         68         93
8 2      TRUE       TRUE              163        163         55         55
# i 1 more variable: prop_disease <dbl>

相关问题