R语言 将数据拆分为三个具有平衡数据的集合

r1zhe5dt  于 2023-02-27  发布在  其他
关注(0)|答案(3)|浏览(230)

编辑:好了,现在我有了训练集、验证集和测试集,这些集的行属于同一组的患者。但是,使用图测试,我看到来自原始数据集的原始不平衡数据(来自结果病变React,1:70%和0:30%)并不是很受尊重...事实上,在训练数据中,我有一个接近55/45的重新分配,这对我来说并不受欢迎。我该怎么做才能纠正这个问题呢?

summary(train$LesionResponse)
#   0   1
# 159 487
summary(validation$LesionResponse)
#  0   1
# 33 170
summary(test$LesionResponse)
#  0   1
# 77 126

大家好,我有我的数据集(这里是一个例子),我必须建立一个预测模型的结果:"病变React"。所以我第一次把我的数据分成训练集(60%)、验证集和测试集(各20%)。我有一个巨大的问题,我的表中的许多行属于相同的患者...所以为了避免偏差,我必须划分我的数据并考虑患者ID ......我被困在这里,因为我不I don "我不知道如何将我的数据一分为三,并将属于同一病人的行放在一起。
下面是我的代码:

structure(list(PatientID = c("P1", "P1", "P1", 
"P2", "P3", "P3", "P4", "P5", 
"P5", "P6"), LesionResponse = structure(c(2L, 2L, 1L, 2L, 2L, 2L, 2L, 
    2L, 1L, 2L), .Label = c("0", 
    "1"), class = "factor"), pyrad_tum_original_shape_LeastAxisLength = c(19.7842995242803, 
    15.0703960571122, 21.0652247652897, 11.804125918871, 27.3980336338908, 
    17.0584330264122, 4.90406343942677, 4.78480430022189, 6.2170232078547, 
    5.96309532740722, 5.30141540007441), pyrad_tum_original_shape_Sphericity = c(0.652056853392657, 
    0.773719977240238, 0.723869070051882, 0.715122964970338, 
    0.70796498824535, 0.811937882810929, 0.836458991713367, 0.863337931630415, 
    0.851654860256904, 0.746212862162174), pyrad_tum_log.sigma.5.0.mm.3D_firstorder_Skewness = c(0.367453961973625, 
    0.117673346718817, 0.0992025164349288, -0.174029385779302, 
    -0.863570016875989, -0.8482193060411, -0.425424618080682, 
    -0.492420174157913, 0.0105111292451967, 0.249865833210199), pyrad_tum_log.sigma.5.0.mm.3D_glcm_Contrast = c(0.376932105256115, 
    0.54885738172596, 0.267158344601612, 2.90094719958076, 0.322424096161189, 
    0.221356030145403, 1.90012334870722, 0.971638740404501, 0.31547550396399, 
    0.653999340294952), pyrad_tum_wavelet.LHH_glszm_GrayLevelNonUniformityNormalized = c(0.154973213866752, 
    0.176128379241556, 0.171129002059539, 0.218343919352019, 
    0.345985943932352, 0.164905080489496, 0.104536489151874, 
    0.1280276816609, 0.137912385073012, 0.133420904484894), pyrad_tum_wavelet.LHH_glszm_LargeAreaEmphasis = c(27390.2818110851, 
    11327.7931034483, 51566.7948885976, 7261.68702290076, 340383.536555142, 
    22724.7792207792, 45.974358974359, 142.588235294118, 266.744186046512, 
    1073.45205479452), pyrad_tum_wavelet.LHH_glszm_LargeAreaLowGrayLevelEmphasis = c(677.011907073653, 
    275.281153810458, 582.131636238695, 173.747506476692, 6140.73990175018, 
    558.277670638306, 1.81042257642817, 4.55724031114589, 6.51794350173746, 
    19.144924585586), pyrad_tum_wavelet.LHH_glszm_SizeZoneNonUniformityNormalized = c(0.411899490603372, 
    0.339216399209913, 0.425584323452468, 0.355165782879786, 
    0.294934042125209, 0.339208410636982, 0.351742274819198, 
    0.394463667820069, 0.360735532720389, 0.36911240382811)), row.names = c(NA, -10L), class = c("tbl_df", "tbl", 
"data.frame"))

我在考虑一个循环,它可以将unique(PatientID)数据集分成三部分,其中60%在训练集中,如果在训练集中没有平衡的结果,就反复地做这个循环。我在考虑更多的是用一个区间来求解它......你们会怎么做?

c9x0cxw0

c9x0cxw01#

***编辑***我误解了您希望如何处理患者ID。原始答案在底部,但请注意分层旨在将每个患者ID的相等比例放入每个拆分中。您应该使用@Rui Barradas指示的group_拆分函数。

library(tidymodels)

set.seed(217)
df_split <- group_initial_split(df, PatientID, prop = 4/5)
df_training <- training(df_split)
df_testing <- testing(df_split)
df_validation <- group_validation_split(df_training, PatientID, prop = 3/4)
  • 原始回复 * 在tidymodels框架中,您可以选择使用PatientID变量对采样进行分层。

要创建所需的拆分,您可以首先将数据拆分为80:20训练:测试,然后将训练集拆分为75:25训练:验证。

library(tidymodels)

set.seed(217)
df_split <- initial_split(df, prop = 4/5, strata = PatientID)
df_training <- training(df_split)
df_testing <- testing(df_split)
df_validation <- validation_split(df_training, prop = 3/4, strata = PatientID)
osh3o9ms

osh3o9ms2#

下面是使用包rsample的方法。
首先拆分test和其他数据(在下面的代码中命名为train),将所有PatientID保持在相同的子集中,然后拆分train

library(rsample)

set.seed(2023)
g <- group_initial_split(df1, group = PatientID, prop = 0.8)
train <- training(g)
test <- testing(g)
g <- group_initial_split(train, group = PatientID, prop = 3/4)
train <- training(g)
validation <- testing(g)

# check data split proportions
df_list <- list(train = train, validation = validation, test = test)
sapply(df_list, nrow)
#>      train validation       test 
#>        600        199        201

# this shows that all groups belong to one subset only
lapply(df_list, \(x) unique(x[[1]]))
#> $train
#> [1] "P5"  "P9"  "P8"  "P3"  "P10" "P4" 
#> 
#> $validation
#> [1] "P2" "P7"
#> 
#> $test
#> [1] "P1" "P6"

创建于2023年2月17日,使用reprex v2.0.2

测试数据

set.seed(2023)
p <- sprintf("P%d", 1:10)
n <- 1e3
df1 <- data.frame(
  PatientID = sample(p, n, TRUE),
  x = rnorm(n)
)

创建于2023年2月17日,使用reprex v2.0.2

ijnw1ujt

ijnw1ujt3#

您可以使用一行程序,其中sample1:3中的一个,用于唯一的患者ID,splitdf

set.seed(42)
res <- split(df, with(df, ave(id, id, FUN=\(x) sample.int(3, 1, prob=c(.6, .2, .2)))))
  • 测试:*
## test proportions (should approx. be [.6, .2, .2])
proportions(sapply(res, \(x) length(unique(x$id)))) |> round(2)
#    1    2    3 
# 0.53 0.25 0.22 

## test uniqueness
stopifnot(length(Reduce(intersect, lapply(res, `[[`, 'id'))) == 0)

更新

为了得到更稳定的比例,我们可以使用固定的组大小,通过向量p来吃1:3

len <- length(u <- unique(df$id))
p1 <- c(.2, .2)
rlp <- round(len*p1)
p <- c(len - sum(rlp), rlp)
set.seed(42)
a <- setNames(rep.int(1:3, p), sample(u))

res <- split(df, a[match(df$id, names(a))])  ## this line splits the df

proportions(sapply(res, \(x) length(unique(x$id))))
#   1   2   3 
# 0.6 0.2 0.2 

## test uniqueness
stopifnot(length(Reduce(intersect, lapply(res, `[[`, 'id'))) == 0)
  • 数据:*
set.seed(42)
n <- 200; np <- 100
df <- data.frame(id=paste0('P', as.integer(as.factor(sort(sample.int(np, n, replace=TRUE))))),
                 les=sample(0:1, n, replace=TRUE),
                 pyr=runif(n))

相关问题