numpy 如何执行分层train_test_split而不进行shuffle？

dwthyt8l 于 2023-10-19 发布在其他

关注(0)|答案(2)|浏览(187)

在探索不同的用例时，我在没有shuffle的情况下进行分层train_test_split时出错。这对时间序列数据很有帮助，但出于演示目的，提供了一个简单的数据集。

验证码：

import pandas as pd
from sklearn.model_selection import train_test_split

# Sample DataFrame, replace this with your actual DataFrame
data = pd.DataFrame({
    'feature1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'target': [0, 1, 0, 1, 0, 1, 0, 1, 0, 1]
})

# Splitting the DataFrame into two equal parts while stratifying on the 'target' column
train_df, test_df = train_test_split(data, test_size=0.2, shuffle=False, stratify=data['target'], random_state=42)
train_df, test_df

错误：

ValueError: Stratified train/test split is not implemented for shuffle=False

是否有更好的方法通过保持顺序（升序或降序）来按test_size分割 Dataframe ？

numpy

来源：https://stackoverflow.com/questions/76978197/how-to-perform-a-stratified-train-test-split-without-shuffle

2条答案

按热度按时间

krcsximq1#

你可以添加一个伪顺序列，用shuffle=True运行train_test_split，然后在伪列上对训练/测试数据集进行排序。
这个想法很像：

data['DUMMY_IND'] = range(df.shape[0])
train_df, test_df = train_test_split(data, test_size=0.2, shuffle=True, stratify=data['target'], random_state=42)

train_df = train_df.sort_values('DUMMY_IND')
test_df = test_df.sort_values('DUMMY_IND')

赞(0）回复(0）举报 2023-10-19

irtuqstp2#

根据文档，如果shuffle=False，则stratize必须为None。
如果你想分割数据，并且前80%的行在训练中，20%的尾在测试中，你不能保证你的样本会分层。如果你想让它们分层，你需要接受第一列/中间列的一些测试。
假设你的数据集是按目标排序的。分层意味着训练和测试必须具有相似的分布。如果你把一个框架的尾部放在测试中，

赞(0）回复(0）举报 2023-10-19

我来回答

numpy 如何执行分层train_test_split而不进行shuffle？

2条答案

相关问题

热门标签

最新问答