R:比较来自不同数据集的亚组

vcirk6k6  于 2023-01-28  发布在  其他
关注(0)|答案(1)|浏览(98)

我正在使用R编程语言。
我有以下包含加拿大人身高和体重的数据集-使用身高(cm)值,将体重(kg)拆分为基于ntiles的bin,并计算每个ntile bin内var2的平均值:

library(dplyr)
library(gtools)
set.seed(123)
canada = data.frame(height =  rnorm(10000,150,10), weight = rnorm(10000,90, 10))

Part_1 = canada %>% 
  mutate(quants = quantcut(weight, 100),
         rank = as.numeric(quants)) %>%
  group_by(quants) %>% 
  mutate(min = min(weight), max = max(weight), count = n(), avg_height = mean(height))

Part_1 = Part_1 %>% distinct(rank, .keep_all = TRUE)

> Part_1
# A tibble: 100 x 8
# Groups:   quants [100]
   height weight quants         rank   min   max count avg_height
    <dbl>  <dbl> <fct>         <dbl> <dbl> <dbl> <int>      <dbl>
 1   144.  114.  (110.2,113.9]    99 110.  114.    100       150.
 2   148.   88.3 (88.12,88.38]    44  88.1  88.4   100       149.
 3   166.   99.3 (99.1,99.52]     83  99.1  99.5   100       152.
 4   151.   84.3 (84.14,84.44]    29  84.1  84.4   100       150.

例如,我看到有100人的体重在100.2 - 113.9公斤之间,这些人的平均身高是150厘米
现在,假设我有一个类似的美国人数据集:

set.seed(124)
usa = data.frame(height =  rnorm(10000,150,10), weight = rnorm(10000,90, 10))
    • 我的问题:根据我使用加拿大数据集计算的体重范围-我想找出有多少美国人属于这些加拿大范围,以及在这些加拿大范围内的美国人的平均体重是多少**

例如:

  • 在加拿大数据集中,我看到有100人的体重在100.2 - 113.9 kg之间,这些人的平均身高为150 cm
  • 有多少美国人的体重在100. 2 - 113. 9公斤之间,这些美国人的平均身高是多少?

我知道我可以为每个等级手动执行此操作:

americans_in_canadian_rank99 = usa %>% 
  filter(weight > 110.2 & weight < 113.9) %>% 
  group_by() %>% 
  summarize(count = n(), avg_height = mean(height))

   americans_in_canadian_rank44 = usa %>% 
      filter(weight > 88.1 & weight < 88.4) %>% 
      group_by() %>% 
      summarize(count = n(), avg_height = mean(height))

最后,我会寻找这样的东西一个理想的输出:

canadian_rank min_weight max_weight canadian_count canadian_avg_height american_count american_avg_height
1            99      110.2      113.9            100                 150            116                 150
2            44       88.1       88.4            100                 149            154                 150
    • 有人能帮我找个更好的方法吗**

谢谢!

vs91vp4v

vs91vp4v1#

使用data.table,您可以执行以下操作:

library(data.table)
library(stringr)

dt1 <- as.data.table(usa)
dt1 <- dt1[, c("min", "max") := weight]

dt2 <- as.data.table(Part_1 %>% select("quants", "rank"))
dt2 <- cbind(dt2[,.(rank)], 
             setDT(tstrsplit(str_sub(dt2$quants, 2, -2), ",", fixed = TRUE, names = c("min", "max"))))
dt2 <- dt2[, lapply(.SD, as.numeric)]
setkey(dt2, min, max)

dt1 <- dt1[, rank := dt2$rank[foverlaps(dt1, dt2, by.x = c("min", "max"), by.y = c("min", "max"), which = TRUE)$yid]] %>% 
  select(-c("min", "max"))

编辑
完全忽略了最后一部分,但是如果你想这样做,最后一点应该相对简单(如果你愿意,你可以使用dplyr):

dt3 <- rbind(canada %>% 
               mutate(quants = quantcut(weight, 100),
                      rank = as.numeric(quants),
                      country = "Canada") %>%
               as.data.table(),
             copy(dt1)[, country := "USA"], fill = TRUE)
dt3 <- dt3[,.(count = .N, avg_height = mean(height)), by = c("rank", "country")] %>% 
  dcast(rank ~ country, value.var = c("count", "avg_height")) %>% 
  merge(dt2 %>% rename("min_weight" = "min", "max_weight" = "max"), by = c("rank"), all.x = TRUE)

编辑2
或者,您可以尝试使用cut函数执行类似的操作,而无需学习data.table

rank_breaks <- Part_1 %>% 
  mutate(breaks = sub(",.*", "", str_sub(quants, 2)) %>% as.numeric()) %>%
  arrange(rank) %>% 
  pull(breaks)

# Here I change minimum and maximum of groups 1 and 100 to -Inf and Inf respectively. 
# If you do not wish to do so, you can disregard it and run `rank_breaks <- c(rank_breaks, max(canada$weight))` instead  
rank_breaks[1] <- -Inf
rank_breaks <- c(rank_breaks, Inf)

usa <- usa %>% 
  mutate(rank = cut(weight, breaks = rank_breaks, labels = c(1:100)))

相关问题