R:使用(group_by .add = TRUE)计算分位数

brgchamk  于 2023-01-10  发布在  其他
关注(0)|答案(1)|浏览(136)

我正在使用R编程语言。
我有以下数据集:

set.seed(123)
library(dplyr)

Patient_ID = 1:5000
gender <- c("Male","Female")
gender <- sample(gender, 5000, replace=TRUE, prob=c(0.45, 0.55))
Gender <- as.factor(gender)

status <- c("Immigrant","Citizen")
status <- sample(status, 5000, replace=TRUE, prob=c(0.3, 0.7))
Status  <- as.factor(status )

Height = rnorm(5000, 150, 10)
Weight = rnorm(5000, 90, 10)
Hospital_Visits = sample.int(20,  5000, replace = TRUE)

################

disease <- c("Yes","No")
disease <- sample(disease, 5000, replace=TRUE, prob=c(0.4, 0.6))
Disease <- as.factor(disease)

###################
my_data = data.frame(Patient_ID, Gender, Status, Height, Weight, Hospital_Visits, Disease)

  Patient_ID Gender    Status   Height    Weight Hospital_Visits Disease
1          1 Female   Citizen 145.0583 113.70725               1      No
2          2   Male Immigrant 161.2759  88.33188              18      No
3          3 Female Immigrant 138.5305  99.26961               6     Yes
4          4   Male   Citizen 164.8102  84.31848              12      No
5          5   Male   Citizen 159.1619  92.25090              12     Yes
6          6 Female   Citizen 153.3513 101.31986              11     Yes

在上一个问题(R: Calculating Proportions Based on Nested Groups)中,我学习了如何基于ntiles计算"嵌套比例"(例如,为一个变量计算3个ntiles,按这3个ntiles分组,然后基于这些ntiles为第二个变量计算3个ntiles,等等):

# e.g. using 3 ntiles

my_data %>% 
  group_by(Gender, Status) %>%
  mutate(Height_ntile = ntile(Height, 3),
         Height_range = paste(min(Height), max(Height), sep = "-")) %>%
  group_by(Height_ntile, Height_range, .add = TRUE) %>%
  mutate(Weight_ntile = ntile(Weight, 3),
         Weight_range = paste(min(Weight), max(Weight), sep = "-")) %>%
  group_by(Weight_ntile, Weight_range, .add = TRUE) %>%
  mutate(Hospital_Visits_ntile = ntile(Hospital_Visits, 3),
         Hospital_range = paste(min(Hospital_Visits), max(Hospital_Visits), sep = "-")) %>%
  group_by(Hospital_Visits_ntile, Hospital_range, .add = TRUE) %>%
  summarize(percent_disease = mean(Disease == "Yes"), 
            count = n(),
            .groups = "drop")
    • 现在,我尝试重复这个完全相同的函数,但使用"分位数"代替:**

我试图修改上面的代码-以下是我的尝试:

my_data %>% 
  group_by(Gender, Status) %>%
  mutate(Height_group = cut(Height, breaks = c(-Inf, 
                                               quantile(Height, c(0.33, 0.67)), 
                                               Inf)),
         Height_range = paste(min(Height), max(Height), sep = "-")) %>%
  group_by(Height_group, Height_range, .add = TRUE) %>%
  mutate(Weight_group = cut(Weight, breaks = c(-Inf, 
                                               quantile(Weight, c(0.33, 0.67)), 
                                               Inf)),
         Weight_range = paste(min(Weight), max(Weight), sep = "-")) %>%
  group_by(Weight_group, Weight_range, .add = TRUE) %>%
  mutate(Hospital_Visits_group = cut(Hospital_Visits, breaks = c(-Inf, 
                                                                quantile(Hospital_Visits, c(0.33, 0.67)), 
                                                                Inf)),
         Hospital_range = paste(min(Hospital_Visits), max(Hospital_Visits), sep = "-")) %>%
  group_by(Hospital_Visits_group, Hospital_range, .add = TRUE) %>%
  summarize(percent_disease = mean(Disease == "Yes"), 
            count = n(),
            .groups = "drop")

这段代码运行了,但我不确定我是否正确地执行了这一操作(例如,出现了"无限"值):

A tibble: 108 x 10
   Gender Status  Height_~1 Heigh~2 Weigh~3 Weigh~4 Hospi~5 Hospi~6 perce~7
   <fct>  <fct>   <fct>     <chr>   <fct>   <chr>   <fct>   <chr>     <dbl>
 1 Female Citizen (-Inf,14~ 115.86~ (-Inf,~ 58.991~ (-Inf,~ 1-20      0.314
 2 Female Citizen (-Inf,14~ 115.86~ (-Inf,~ 58.991~ (7,14]  1-20      0.458
    • 有人能告诉我我是否做对了吗**

谢谢!

zte4gxcn

zte4gxcn1#

答案基于@akrun提供的见解:

my_data %>% 
  group_by(Gender, Status) %>%
  mutate(Height_group = as.integer(cut(Height, breaks = c(-Inf, 
                                               quantile(Height, c(0.33, 0.67)), 
                                               Inf))),
         Height_range = paste(min(Height), max(Height), sep = "-")) %>%
  group_by(Height_group, Height_range, .add = TRUE) %>%
  mutate(Weight_group = as.integer(cut(Weight, breaks = c(-Inf, 
                                               quantile(Weight, c(0.33, 0.67)), 
                                               Inf))),
         Weight_range = paste(min(Weight), max(Weight), sep = "-")) %>%
  group_by(Weight_group, Weight_range, .add = TRUE) %>%
  mutate(Hospital_Visits_group = as.integer(cut(Hospital_Visits, breaks = c(-Inf, 
                                                                quantile(Hospital_Visits, c(0.33, 0.67)), 
                                                                Inf))),
         Hospital_range = paste(min(Hospital_Visits), max(Hospital_Visits), sep = "-")) %>%
  group_by(Hospital_Visits_group, Hospital_range, .add = TRUE) %>%
  summarize(percent_disease = mean(Disease == "Yes"), 
            count = n(),
            .groups = "drop")

我理解得对吗?

相关问题