python OneHotencoded分类数据的欠采样/过采样问题

nbnkbykc  于 2023-02-07  发布在  Python
关注(0)|答案(1)|浏览(214)

我试图适应一个分类问题,其中有一个(40000对400)分裂之间的0和1类。我试图发挥周围的过采样和欠采样(不是首选),但不断遇到问题。
错误-传递值的形状为(34372,1),索引隐含(34372,36)

258   print("Before undersampling X_train:",X_train.shape[0])
    259 
--> 260   X_train,y_train=ros(X_train,y_train) #change this to ro_smote for oversampling
    261   print("After undersampling/oversampling X_train:",X_train.shape[0])
    262   X_train[label_fg] = y_train

/tmp/tmpta5bmz69.py in ros(X_train, y_train)
    131 def ros(X_train,y_train):
    132     ros = RandomOverSampler(random_state=1,sampling_strategy = 0.25) #sampling-stragey- 0.25,0.5,1,0.75
--> 133     X_train_on, y_train_on = ros.fit_resample(X_train, y_train)
    134 
    135     return X_train_on,y_train_on

/databricks/python/lib/python3.8/site-packages/imblearn/base.py in fit_resample(self, X, y)
     87         )
     88 
---> 89         X_, y_ = arrays_transformer.transform(output[0], y_)
     90         return (X_, y_) if len(output) == 2 else (X_, y_, output[2])
     91 

/databricks/python/lib/python3.8/site-packages/imblearn/utils/_validation.py in transform(self, X, y)
     38 
     39     def transform(self, X, y):
---> 40         X = self._transfrom_one(X, self.x_props)
     41         y = self._transfrom_one(y, self.y_props)
     42         return X, y

/databricks/python/lib/python3.8/site-packages/imblearn/utils/_validation.py in _transfrom_one(self, array, props)
     57             import pandas as pd
     58 
---> 59             ret = pd.DataFrame(array, columns=props["columns"])
     60             ret = ret.astype(props["dtypes"])
     61         elif type_ == "series":

/databricks/python/lib/python3.8/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
    582                     mgr = arrays_to_mgr(arrays, columns, index, columns, dtype=dtype)
    583                 else:
--> 584                     mgr = init_ndarray(data, index, columns, dtype=dtype, copy=copy)
    585             else:
    586                 mgr = init_dict({}, index, columns, dtype=dtype)

/databricks/python/lib/python3.8/site-packages/pandas/core/internals/construction.py in init_ndarray(values, index, columns, dtype, copy)
    236         block_values = [values]
    237 
--> 238     return create_block_manager_from_blocks(block_values, [columns, index])
    239 
    240 

/databricks/python/lib/python3.8/site-packages/pandas/core/internals/managers.py in create_block_manager_from_blocks(blocks, axes)
   1685         blocks = [getattr(b, "values", b) for b in blocks]
   1686         tot_items = sum(b.shape[0] for b in blocks)
-> 1687         raise construction_error(tot_items, blocks[0].shape[1:], axes, e)
   1688 
   1689 

ValueError: Shape of passed values is (34372, 1), indices imply (34372, 36)Thu Aug 25 14:52:24 2022 Python shell started with PID  4674  and guid  b28118c68bbf497ea6029cc003bff481

请注意,我有一个hotencoded我的分类数据集,这导致了36个功能,我已经改变成'int'。
我错过什么了吗?

preped_data=feature_engg(preped_data)
  preped_data = preped_data.astype(int)
  def ros(X_train,y_train):
    ros = RandomOverSampler(random_state=1,sampling_strategy = 0.25) 
    X_train_on, y_train_on = ros.fit_resample(X_train, y_train)
    
    return X_train_on,y_train_on

  
  
  label_fg='churn_fg'  
  
  X_train, X_test, y_train, y_test = train_test_split(
    preped_data.drop(label_fg, axis=1), preped_data[label_fg], stratify=preped_data[label_fg],
    shuffle=True, test_size=0.3, random_state=42)
  
  print("Before undersampling X_train columns:",X_train.columns)
  print("Before undersampling X_train:",X_train.shape[0])
  
  X_train,y_train=ros(X_train,y_train)
zujrkrfu

zujrkrfu1#

我在使用one-hot-encoder后遇到了同样的问题。在我的例子中,这个问题是因为one-hot-encoder返回稀疏矩阵(运行df.info()检查)。为了解决这个问题,我在one-hot编码后尝试了以下方法:

X_train = X_train.apply(pd.to_numeric, errors='coerce') 
X_test = X_test.apply(pd.to_numeric, errors='coerce')

X_train[oh-cols] = X_train[oh-cols].sparse.to_dense()
X_test[oh-cols] = X_test[oh-cols].sparse.to_dense()

其中oh-cols是需要应用独热码编码的列。

相关问题