R:重叠分组的Ntiles?

qfe3c7zg  于 2023-01-10  发布在  其他
关注(0)|答案(1)|浏览(145)

我正在使用R编程语言。
我有以下数据集:

set.seed(123)
library(dplyr)

var1 = rnorm(10000, 100,100)
var2 = rnorm(10000, 100,100)
var3 = rnorm(10000, 100,100)
var4 = rnorm(10000, 100,100)
var5 <- factor(sample(c("A","B", "C", "D", "E"), 1000, replace=TRUE, prob=c(0.2, 0.2, 0.2, 0.2, 0.2)))
var6 <- factor(sample(c("A","B", "C", "D", "E"), 1000, replace=TRUE, prob=c(0.2, 0.2, 0.2, 0.2, 0.2)))

my_data = data.frame( var1, var2, var3, var4, var5, var6)

然后我使用下面的代码根据"var5"和"var6"找出"var1"的ntile(例如ntile = 4)范围:

test = data.frame(my_data %>%
    group_by(var5, var6) %>%
    mutate(group = ntile(var1, 4)) %>%
    group_by(var5, group) %>%
    mutate(min = min(var1),
           max = max(var1)) %>%
    mutate(range = paste(min, max, sep = "-")) %>%
mutate(count = n()) %>%
    ungroup())

我开始检查这些范围--例如,查看第一组:

t = test[test$var5 == "A" & test$var6 == "A",]
table(t$range)

-284.532016946004-41.1223359161037  155.096551597729-439.037082127154  31.8870220096767-101.689288211385  94.4497431366975-175.804225191541 
                               123                                122                                123  

                          122
    • 我注意到了一些问题:**
  • 某些范围重叠(例如31 - 101和94 - 175)
  • 其中一个范围的上限大于var1的最大值(与var1的下限/最小值存在相同问题)

例如

> min(t$var1)
[1] -184.3018
> max(t$var1)
[1] 352.2398
    • 有人能告诉我如何修复我的代码,以便我可以解决这个问题吗?**

谢谢!
参考:

lf5gs5x2

lf5gs5x21#

答案由@akrun提供:

set.seed(123)
library(dplyr)

var1 = rnorm(10000, 100,100)
var2 = rnorm(10000, 100,100)
var3 = rnorm(10000, 100,100)
var4 = rnorm(10000, 100,100)
var5 <- factor(sample(c("A","B", "C", "D", "E"), 1000, replace=TRUE, prob=c(0.2, 0.2, 0.2, 0.2, 0.2)))
var6 <- factor(sample(c("A","B", "C", "D", "E"), 1000, replace=TRUE, prob=c(0.2, 0.2, 0.2, 0.2, 0.2)))


my_data = data.frame( var1, var2, var3, var4, var5, var6)

test <- my_data %>%     group_by(var5, var6) %>%     mutate(group = ntile(var1, 4)) %>%     group_by(var5, var6, group) %>%     mutate(min = min(var1),            max = max(var1)) %>%     mutate(range = paste(min, max, sep = "-")) %>% mutate(count = n()) %>%     ungroup()

t = test[test$var5 == "A" & test$var6 == "A",]
table(t$range)


-184.301789503408-41.1223359161037  164.320351401576-352.239807042822  41.6359330862004-96.5932746261536  97.4673839365592-162.203323587106 
                               123                                122                                123                                122

相关问题