pandas 机器学习:在OneHotEncoder之后获取 Dataframe

4xrmg8kj  于 2023-02-02  发布在  其他
关注(0)|答案(2)|浏览(150)

我一直在思考如何将OneHotEnocder的结果转换回DataFrame。我将数值列与分类列分开的想法如下所示:

feats = df.drop(["Transported"], axis=1)  
target = df["Transported"]

从sklearn. model_selection导入列车测试拆分

X_train, X_test, y_train, y_test = train_test_split(feats, target, test_size = 0.2, 
 random_state=42)

在这里做了分割后,我需要从猫中分离num用于训练集,我这样做了:

num_train = X_train.select_dtypes(include=['float64', 'int64'])
cat_train = X_train.select_dtypes(include=['object'])
num_test = X_test.select_dtypes(include=['float64', 'int64'])
cat_test = X_test.select_dtypes(include=['object'])

在此之后,我做了简单的imputer和它的工作。

imputer_median = SimpleImputer(missing_values=np.nan, strategy='median')
imputer_most_frequent = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

num = ["Age", "RoomService", "FoodCourt", "ShoppingMall","Spa","VRDeck"]
num_train.loc[:,num] = imputer_median.fit_transform(num_train[num])
num_test.loc[:,num] = imputer_median.transform(num_test[num])

cat = ["HomePlanet", "CryoSleep", "Destination","VIP"]
cat_train.loc[:,cat] = imputer_most_frequent.fit_transform(cat_train[cat])
cat_test.loc[:,cat] = imputer_most_frequent.transform(cat_test[cat])

这是猫火车的头

cat_train.head()
     HomePlanet CryoSleep   Destination VIP
2333    Earth   False   TRAPPIST-1e False
2589    Earth   False   TRAPPIST-1e False
8302    Europa  True    55 Cancri e False
8177    Mars    False   TRAPPIST-1e False
 500    Europa  True    55 Cancri e False

但是,在此之后,我需要应用OneHotEncoder就像这样:

from sklearn.preprocessing import OneHotEncoder
oneh = OneHotEncoder( drop='first',sparse=False)

cat_train.loc[:,cat] = oneh.fit_transform(cat_train[cat])
cat_train.loc[:,cat] = oneh.fit_transform(cat_train[cat])

我得到了这个错误:

shape mismatch: value array of shape (6954,6) could not be broadcast to indexing result 
of shape (6954,4)

我尝试了几种方法,但每次我都不能成功地有一个数据框后,OneHotEncoder回来。请帮帮我,我是堆叠在这一点上,我不能继续其余的工作。提前感谢
下面是完整的追溯错误:

ValueError                                Traceback (most recent 
call last)
~\AppData\Local\Temp\ipykernel_16200\2252764984.py in <module>
  3 oneh = OneHotEncoder( drop='first',sparse=False)
  4 
----> 5 cat_train.loc[:,cat] = oneh.fit_transform(cat_train[cat])
  6 cat_train.loc[:,cat] = oneh.fit_transform(cat_train[cat])

~\anaconda3\lib\site-packages\pandas\core\indexing.py in 
__setitem__(self, key, value)
714 
715         iloc = self if self.name == "iloc" else self.obj.iloc
--> 716         iloc._setitem_with_indexer(indexer, value, 
self.name)
717 
718     def _validate_key(self, key, axis: int):

~\anaconda3\lib\site-packages\pandas\core\indexing.py in 
_setitem_with_indexer(self, indexer, value, name)

1691 self._setitem_with_indexer_split_path(索引器、值、名称)1692否则:- 〉1693自身.设置项单个块(索引器、值、名称)1694 1695定义带有索引器拆分路径的设置项(自身、索引器、值、名称:字符串):

~\anaconda3\lib\site-packages\pandas\core\indexing.py in 
_setitem_single_block(self, indexer, value, name)
1941 
1942         # actually do the set
-> 1943         self.obj._mgr = 
self.obj._mgr.setitem(indexer=indexer, value=value)
 1944         self.obj._maybe_update_cacher(clear=True, 
inplace=True)
 1945 

~\anaconda3\lib\site-packages\pandas\core\internals\managers.py in 
setitem(self, indexer, value)
335         For SingleBlockManager, this backs s[indexer] = value
336         """
--> 337         return self.apply("setitem", indexer=indexer, 
value=value)
338 
339     def putmask(self, mask, new, align: bool = True):

~\anaconda3\lib\site-packages\pandas\core\internals\managers.py in 
apply(self, f, align_keys, ignore_failures, **kwargs)
302                     applied = b.apply(f, **kwargs)
303                 else:
--> 304                     applied = getattr(b, f)(**kwargs)
305             except (TypeError, NotImplementedError):
306                 if not ignore_failures:

~\anaconda3\lib\site-packages\pandas\core\internals\blocks.py in 
setitem(self, indexer, value)
957         else:
958             value = setitem_datetimelike_compat(values, 
len(values[indexer]), value)
--> 959             values[indexer] = value
960 
961         return self

ValueError: shape mismatch: value array of shape (6954,6) could not 
be broadcast to indexing result of shape (6954,4)

这次我试了下一招:

from sklearn.preprocessing import OneHotEncoder
oneh = OneHotEncoder(handle_unknown='ignore')

cat_train.loc[:,cat] = oneh.fit_transform(cat_train[cat])
cat_test.loc[:,cat] = oneh.transform(cat_test)

我得到了这个 Dataframe ,但这不是我想要的:

HomePlanet  CryoSleep   Destination VIP
2333    (0, 0)\t1.0\n (0, 3)\t1.0\n (0, 7)\t1.0\n ...   (0, 
0)\t1.0\n (0, 3)\t1.0\n (0, 7)\t1.0\n ...   (0, 0)\t1.0\n (0, 
3)\t1.0\n (0, 7)\t1.0\n ... (0, 0)\t1.0\n (0, 3)\t1.0\n (0, 
7)\t1.0\n ...
2589    (0, 0)\t1.0\n (0, 3)\t1.0\n (0, 7)\t1.0\n ...   (0, 
0)\t1.0\n (0, 3)\t1.0\n (0, 7)\t1.0\n ...   (0, 0)\t1.0\n (0, 
3)\t1.0\n (0, 7)\t1.0\n ... (0, 0)\t1.0\n (0, 3)\t1.0\n (0, 
7)\t1.0\n ...

我也使用了列变压器;但是它没有让我回到我想要的 Dataframe (我的意思是在onehotencoder之前使用的原始列的 Dataframe (看上面的cat_train))这是我做的步骤:

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer(
    transformers=[("OneHotEncoder", OneHotEncoder(drop='first', 
sparse=False), cat)],
    remainder='passthrough'
)

cat_train = ct.fit_transform(cat_train)
cat_test = ct.transform(cat_test)

cat_train = pd.DataFrame(cat_train, columns=ct.get_feature_names())
cat_test = pd.DataFrame(cat_test, columns=ct.get_feature_names())

cat_train

我得到的cat_train. head()是:

OneHotEncoder__x0_Europa    OneHotEncoder__x0_Mars  OneHotEncoder__x1_True  OneHotEncoder__x2_PSO J318.5-22 OneHotEncoder__x2_TRAPPIST-1e   OneHotEncoder__x3_True

0 0.0 0.0 0.0 0.0 1.0 0.0 1 0.0 0.0 0.0 1.0 0.0 2 1.0 0.0 1.0 0.0 0.0
这很奇怪,因为接下来我需要把cat_train和num_train连接起来,同样是为了测试,我这样做了,很多的NAN值会出现,而我之前已经插补了所有的NAN值。2有什么想法吗?

0s7z1bwu

0s7z1bwu1#

第一个错误是因为您试图将one-hot编码的数据分配回相同的原始列,该数据的列数比原始列多。您需要添加这些虚拟列并删除原始列。无论如何,将fit_transform应用于train和test(假设重复的train行是一个拼写错误)是个坏主意。
第二个错误似乎是由于one-hot编码的数据是稀疏的,你可以在OneHotEncoder中指定sparse=False来修正这个问题,但是你可能会遇到和上面相同的问题。
最好的办法是使用ColumnTransformer;它会帮你处理所有的连接。

i2byvkas

i2byvkas2#

我成功地找到了解决办法。事实上,我是想拿回原来的(因为我有4列,所以我想应该恢复这些列)列,因为它们在OneHotEnoder之前,这通常是不可能的。在我的例子中,对于每个cat_train列,我有一个不同的模态(不止一个),所以OneHotEncoder之后的结果必须比之前多出一列。因此,基于此,我重新生成了代码,如下所示:

feats = df.drop(["Transported"], axis=1)  
target = df["Transported"]

X_train, X_test, y_train, y_test = train_test_split(feats, target, 
test_size = 0.2, random_state=42)

将数值列与分类列分开

import numpy as np
num_train = X_train.select_dtypes(include=[np.number])
cat_train = X_train.select_dtypes(exclude=[np.number])
num_test = X_test.select_dtypes(include=[np.number])
cat_test = X_test.select_dtypes(exclude=[np.number])

填写缺失值

num_imp = SimpleImputer(strategy='median')
num_train = num_imp.fit_transform(num_train)
num_test = num_imp.transform(num_test)
cat_imp = SimpleImputer(strategy='most_frequent')
cat_train = cat_imp.fit_transform(cat_train)
cat_test = cat_imp.transform(cat_test)

编码分类变量

cat_enc = OneHotEncoder(handle_unknown='ignore')
cat_train = cat_enc.fit_transform(cat_train)
cat_test = cat_enc.transform(cat_test)

现在是神奇的部分重新构建训练和测试集

X_train = pd.concat([pd.DataFrame(num_train), 
pd.DataFrame(cat_train.toarray())], axis=1)

X_test = pd.concat([pd.DataFrame(num_test), 
pd.DataFrame(cat_test.toarray())], axis=1)

Dataframe 现在是它应该的样子

X_train.head()

    0   1   2   3   4   5   0   1   2   3   4   5   6   7   8   9
0   28.0    0.0 55.0    0.0 656.0   0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 1.0  0.0
1   17.0    0.0 1195.0  31.0    0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 1.0 0.0
2   28.0    0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 0.0
3   20.0    0.0 2.0 289.0   976.0   0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 1.0 1.0 0.0
4   36.0    0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 0.0

相关问题