**在分类器建模过程中实现SMOTE()
的正确方法是什么?**我真的很困惑如何在那里应用SMOTE()
。假设我将数据集分为训练和测试,就像这样作为初学者:
from sklearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as imbpipeline
from sklearn.model_selection import GridSearchCV, train_test_split
# Some dataset initialization
X = df.drop(['things'], axis = 1)
y = df['things']
# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
# SMOTE() on the train dataset:
X_train_smote, y_train_smote = SMOTE().fit_resample(X_train, y_train, random_state=42)
字符串
在将SMOTE()
应用于上述分类问题的训练数据集之后,我的问题是:
1.在像这样分割数据集之后,我应该在管道内应用SMOTE()
**吗?:
# Pipeline for scaling and initializing the model
pipeline = imbpipeline(steps = [('scale', StandardScaler()),
('over', SMOTE(random_state = 42)),
('model', LogisticRegression(random_state = 42))])
# Then do model evaluation with Repeated Stratified KFold,
# Then do Grid Search for hyperparameter tuning
# Then do the actual model testing with unseen X_test (Like this):
cv = RepeatedStratifiedKFold(n_splits = 10, n_repeats = 3, random_state = 42)
params = {'model__penalty': ['l1', 'l2'],
'model__C':[0.001, 0.01, 0.1, 5, 10, 100]}
grid = GridSearchCV(estimator = pipeline,
param_grid = params,
scoring = 'roc_auc',
cv = cv,
n_jobs = -1)
grid.fit(X_train_smote, y_train_smote)
cv_score = grid.best_score_
test_score = grid.score(X_test, y_test)
print(f"Cross-validation score: {cv_score} \n Test Score: {test_score}")
型
1.或者,我应该应用管道**而不像这样调用SMOTE()
吗?
# Pipeline for scaling and initializing the model
pipeline = imbpipeline(steps = [('scale', StandardScaler()),
('model', LogisticRegression(random_state = 42))])
# Same process as above for modeling, evaluation, etc...
型
1.或者,我应该像这样使用SMOTE()
,而不像这样使用SMOTE'd数据:
# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
# Pipeline for scaling and initializing the model
pipeline = imbpipeline(steps = [('scale', StandardScaler()),
('over', SMOTE(random_state = 42)),
('model', LogisticRegression(random_state = 42))])
# Same process as above for modeling, evaluation, etc...
# BUT!, when fitting grid.fit(), we do this?:
grid.fit(X_train, y_train)
型
1.或者像这样在Sklearn的Pipeline中使用SMOTE()
训练数据?:
X_train_smote, y_train_smote = SMOTE().fit_resample(X_train, y_train, random_state=42)
pipeline = Pipeline(steps = [('scale', StandardScaler()),
('model', LogisticRegression(random_state = 42))])
# Same process as above for modeling, evaluation, etc...
# BUT!, when fitting grid.fit(), we do this?:
grid.fit(X_train_smote, y_train_smote)
型
1条答案
按热度按时间lmyy7pcs1#
一般来说,你想要SMOTE训练数据,而不是验证或测试数据。因此,如果你想使用折叠交叉验证,你不能在将数据发送到该流程之前SMOTE数据。
1.不,你运行了两次SMOTE(在管道之前和管道内部)。而且,你在验证折叠中有SMOTEd点,这是你不想要的。
1.否,验证折叠中将有SMOTEd点。
1.这就是做这件事的方法。
1.否,验证折叠中将有SMOTEd点。
我建议您查看
sklearn.metrics.roc_auc_score()
以及您使用的任何其他指标,因为它可以揭示不正确分割重采样数据所导致的问题。(SMOTEd点可以非常预测,但不会改善AUC。)