我一直在思考如何将OneHotEnocder的结果转换回DataFrame。我将数值列与分类列分开的想法如下所示:
feats = df.drop(["Transported"], axis=1)
target = df["Transported"]
从sklearn. model_selection导入列车测试拆分
X_train, X_test, y_train, y_test = train_test_split(feats, target, test_size = 0.2,
random_state=42)
在这里做了分割后,我需要从猫中分离num用于训练集,我这样做了:
num_train = X_train.select_dtypes(include=['float64', 'int64'])
cat_train = X_train.select_dtypes(include=['object'])
num_test = X_test.select_dtypes(include=['float64', 'int64'])
cat_test = X_test.select_dtypes(include=['object'])
在此之后,我做了简单的imputer和它的工作。
imputer_median = SimpleImputer(missing_values=np.nan, strategy='median')
imputer_most_frequent = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
num = ["Age", "RoomService", "FoodCourt", "ShoppingMall","Spa","VRDeck"]
num_train.loc[:,num] = imputer_median.fit_transform(num_train[num])
num_test.loc[:,num] = imputer_median.transform(num_test[num])
cat = ["HomePlanet", "CryoSleep", "Destination","VIP"]
cat_train.loc[:,cat] = imputer_most_frequent.fit_transform(cat_train[cat])
cat_test.loc[:,cat] = imputer_most_frequent.transform(cat_test[cat])
这是猫火车的头
cat_train.head()
HomePlanet CryoSleep Destination VIP
2333 Earth False TRAPPIST-1e False
2589 Earth False TRAPPIST-1e False
8302 Europa True 55 Cancri e False
8177 Mars False TRAPPIST-1e False
500 Europa True 55 Cancri e False
但是,在此之后,我需要应用OneHotEncoder就像这样:
from sklearn.preprocessing import OneHotEncoder
oneh = OneHotEncoder( drop='first',sparse=False)
cat_train.loc[:,cat] = oneh.fit_transform(cat_train[cat])
cat_train.loc[:,cat] = oneh.fit_transform(cat_train[cat])
我得到了这个错误:
shape mismatch: value array of shape (6954,6) could not be broadcast to indexing result
of shape (6954,4)
我尝试了几种方法,但每次我都不能成功地有一个数据框后,OneHotEncoder回来。请帮帮我,我是堆叠在这一点上,我不能继续其余的工作。提前感谢
下面是完整的追溯错误:
ValueError Traceback (most recent
call last)
~\AppData\Local\Temp\ipykernel_16200\2252764984.py in <module>
3 oneh = OneHotEncoder( drop='first',sparse=False)
4
----> 5 cat_train.loc[:,cat] = oneh.fit_transform(cat_train[cat])
6 cat_train.loc[:,cat] = oneh.fit_transform(cat_train[cat])
~\anaconda3\lib\site-packages\pandas\core\indexing.py in
__setitem__(self, key, value)
714
715 iloc = self if self.name == "iloc" else self.obj.iloc
--> 716 iloc._setitem_with_indexer(indexer, value,
self.name)
717
718 def _validate_key(self, key, axis: int):
~\anaconda3\lib\site-packages\pandas\core\indexing.py in
_setitem_with_indexer(self, indexer, value, name)
1691 self._setitem_with_indexer_split_path(索引器、值、名称)1692否则:- 〉1693自身.设置项单个块(索引器、值、名称)1694 1695定义带有索引器拆分路径的设置项(自身、索引器、值、名称:字符串):
~\anaconda3\lib\site-packages\pandas\core\indexing.py in
_setitem_single_block(self, indexer, value, name)
1941
1942 # actually do the set
-> 1943 self.obj._mgr =
self.obj._mgr.setitem(indexer=indexer, value=value)
1944 self.obj._maybe_update_cacher(clear=True,
inplace=True)
1945
~\anaconda3\lib\site-packages\pandas\core\internals\managers.py in
setitem(self, indexer, value)
335 For SingleBlockManager, this backs s[indexer] = value
336 """
--> 337 return self.apply("setitem", indexer=indexer,
value=value)
338
339 def putmask(self, mask, new, align: bool = True):
~\anaconda3\lib\site-packages\pandas\core\internals\managers.py in
apply(self, f, align_keys, ignore_failures, **kwargs)
302 applied = b.apply(f, **kwargs)
303 else:
--> 304 applied = getattr(b, f)(**kwargs)
305 except (TypeError, NotImplementedError):
306 if not ignore_failures:
~\anaconda3\lib\site-packages\pandas\core\internals\blocks.py in
setitem(self, indexer, value)
957 else:
958 value = setitem_datetimelike_compat(values,
len(values[indexer]), value)
--> 959 values[indexer] = value
960
961 return self
ValueError: shape mismatch: value array of shape (6954,6) could not
be broadcast to indexing result of shape (6954,4)
这次我试了下一招:
from sklearn.preprocessing import OneHotEncoder
oneh = OneHotEncoder(handle_unknown='ignore')
cat_train.loc[:,cat] = oneh.fit_transform(cat_train[cat])
cat_test.loc[:,cat] = oneh.transform(cat_test)
我得到了这个 Dataframe ,但这不是我想要的:
HomePlanet CryoSleep Destination VIP
2333 (0, 0)\t1.0\n (0, 3)\t1.0\n (0, 7)\t1.0\n ... (0,
0)\t1.0\n (0, 3)\t1.0\n (0, 7)\t1.0\n ... (0, 0)\t1.0\n (0,
3)\t1.0\n (0, 7)\t1.0\n ... (0, 0)\t1.0\n (0, 3)\t1.0\n (0,
7)\t1.0\n ...
2589 (0, 0)\t1.0\n (0, 3)\t1.0\n (0, 7)\t1.0\n ... (0,
0)\t1.0\n (0, 3)\t1.0\n (0, 7)\t1.0\n ... (0, 0)\t1.0\n (0,
3)\t1.0\n (0, 7)\t1.0\n ... (0, 0)\t1.0\n (0, 3)\t1.0\n (0,
7)\t1.0\n ...
我也使用了列变压器;但是它没有让我回到我想要的 Dataframe (我的意思是在onehotencoder之前使用的原始列的 Dataframe (看上面的cat_train))这是我做的步骤:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer(
transformers=[("OneHotEncoder", OneHotEncoder(drop='first',
sparse=False), cat)],
remainder='passthrough'
)
cat_train = ct.fit_transform(cat_train)
cat_test = ct.transform(cat_test)
cat_train = pd.DataFrame(cat_train, columns=ct.get_feature_names())
cat_test = pd.DataFrame(cat_test, columns=ct.get_feature_names())
cat_train
我得到的cat_train. head()是:
OneHotEncoder__x0_Europa OneHotEncoder__x0_Mars OneHotEncoder__x1_True OneHotEncoder__x2_PSO J318.5-22 OneHotEncoder__x2_TRAPPIST-1e OneHotEncoder__x3_True
0 0.0 0.0 0.0 0.0 1.0 0.0 1 0.0 0.0 0.0 1.0 0.0 2 1.0 0.0 1.0 0.0 0.0
这很奇怪,因为接下来我需要把cat_train和num_train连接起来,同样是为了测试,我这样做了,很多的NAN值会出现,而我之前已经插补了所有的NAN值。2有什么想法吗?
2条答案
按热度按时间0s7z1bwu1#
第一个错误是因为您试图将one-hot编码的数据分配回相同的原始列,该数据的列数比原始列多。您需要添加这些虚拟列并删除原始列。无论如何,将
fit_transform
应用于train和test(假设重复的train
行是一个拼写错误)是个坏主意。第二个错误似乎是由于one-hot编码的数据是稀疏的,你可以在
OneHotEncoder
中指定sparse=False
来修正这个问题,但是你可能会遇到和上面相同的问题。最好的办法是使用
ColumnTransformer
;它会帮你处理所有的连接。i2byvkas2#
我成功地找到了解决办法。事实上,我是想拿回原来的(因为我有4列,所以我想应该恢复这些列)列,因为它们在OneHotEnoder之前,这通常是不可能的。在我的例子中,对于每个cat_train列,我有一个不同的模态(不止一个),所以OneHotEncoder之后的结果必须比之前多出一列。因此,基于此,我重新生成了代码,如下所示:
将数值列与分类列分开
填写缺失值
编码分类变量
现在是神奇的部分重新构建训练和测试集
Dataframe 现在是它应该的样子