我正在使用R编程语言。
我有以下数据集:
set.seed(123)
library(dplyr)
var1 = rnorm(10000, 100,100)
var2 = rnorm(10000, 100,100)
var3 = rnorm(10000, 100,100)
var4 = rnorm(10000, 100,100)
var5 <- factor(sample(c("Yes", "No"), 1000, replace=TRUE, prob=c(0.4, 0.6)))
var6 <- factor(sample(c("Yes", "No"), 1000, replace=TRUE, prob=c(0.4, 0.6)))
my_data = data.frame( var1, var2, var3, var4, var5, var6)
我想根据分类变量计算此数据集中不同列的“分组百分位数”(例如,在任意水平)。
最初,我尝试使用函数来实现这一点-但这一直给我带来了很多困难(例如R: Difficulty Calculating Percentiles?)。
**因此,我尝试同时“手动”执行此操作。**例如,假设:
- 基于var 5和var 6分组
- 我想创建一个变量“class 3”,将var 3分成10个百分点的组
- 我想创建一个变量“class 4”,将var 4分成20个百分点的组
举个例子,下面是我尝试的两种不同方法:
**方法1:**产生一些NA?
library(dplyr)
final = my_data %>% group_by(var5, var6) %>%
mutate(class3 = case_when(ntile(var3, 10) == 1 ~ paste0(round(min(var3), 2), " to ", round(quantile(var3, 0.1), 2), " decile 1"),
ntile(var3, 10) == 2 ~ paste0(round(quantile(var3, 0.1), 2), " to ", round(quantile(var3, 0.2), 2), " decile 2"),
ntile(var3, 10) == 3 ~ paste0(round(quantile(var3, 0.2), 2), " to ", round(quantile(var3, 0.3), 2), " decile 3"),
ntile(var3, 10) == 4 ~ paste0(round(quantile(var3, 0.3), 2), " to ", round(quantile(var3, 0.4), 2), " decile 4"),
ntile(var3, 10) == 5 ~ paste0(round(quantile(var3, 0.4), 2), " to ", round(quantile(var3, 0.5), 2), " decile 5"),
ntile(var3, 10) == 6 ~ paste0(round(quantile(var3, 0.5), 2), " to ", round(quantile(var3, 0.6), 2), " decile 6"),
ntile(var3, 10) == 7 ~ paste0(round(quantile(var3, 0.6), 2), " to ", round(quantile(var3, 0.7), 2), " decile 7"),
ntile(var3, 10) == 8 ~ paste0(round(quantile(var3, 0.7), 2), " to ", round(quantile(var3, 0.8), 2), " decile 8"),
ntile(var3, 10) == 9 ~ paste0(round(quantile(var3, 0.8), 2), " to ", round(quantile(var3, 0.9), 2), " decile 9"),
ntile(var3, 10) == 10 ~ paste0(round(quantile(var3, 0.9), 2), " to ", round(max(var3), 2), " decile 10"))) %>%
mutate(class4 = case_when(ntile(var4, 20) == 1 ~ paste0(round(min(var4), 2), " to ", round(quantile(var4, 0.1), 2), " pcile 1"),
ntile(var4, 20) == 2 ~ paste0(round(quantile(var4, 0.1), 2), " to ", round(quantile(var4, 0.2), 2), " pcile 2"),
ntile(var4, 20) == 3 ~ paste0(round(quantile(var4, 0.2), 2), " to ", round(quantile(var4, 0.3), 2), " pcile 3"),
ntile(var4, 20) == 4 ~ paste0(round(quantile(var4, 0.3), 2), " to ", round(quantile(var4, 0.4), 2), " pcile 4"),
ntile(var4, 20) == 5 ~ paste0(round(quantile(var4, 0.4), 2), " to ", round(quantile(var4, 0.5), 2), " pcile 5")))
**方法2:**减少NA?
final = my_data %>% group_by(var5, var6) %>% mutate(class3 = paste0(cut(var3, breaks = c(-Inf, quantile(var3, c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)), Inf),
labels = c("ptile 1", "ptile 2", "ptile 3", "ptile 4", "ptile 5", "ptile 6", "ptile 7", "ptile 8", "ptile 9", "ptile 10")),
" (", round(min(var3), 2), " to ", round(max(var3), 2), ")")) %>%
mutate(class4 = paste0(cut(var4, breaks = c(-Inf, quantile(var4, c(0.2, 0.4, 0.6, 0.8)), Inf),
labels = c("ptile 1", "ptile 2", "ptile 3", "ptile 4", "ptile 5")),
" (", round(min(var4), 2), " to ", round(max(var4), 2), ")"))
我认为方法2可能更正确,因为生成的NA值更少-但同时,有人能帮我验证一下(方法2中的)方法是否正确......如果不正确,我该如何更正?
谢谢!
2条答案
按热度按时间ddrv8njm1#
对于方法1,我相信您主要有
NA
值,因为您需要使用ntile(..., 5)
而不是ntile(..., 20)
。kpbpu0082#
我们可以使用quantile()函数轻松计算R的百分位数,该函数使用以下语法:
x:数值向量,我们希望找到其百分位数概率:[0,1]中概率的数值向量,表示我们希望找到的百分位数