xgboost模型预测误差:输入numpy.ndarray必须是二维的

knpiaxh1  于 2023-03-12  发布在  其他
关注(0)|答案(2)|浏览(442)

我有一个在本地训练并部署到引擎的模型,这样我就可以进行推断/调用端点。当我尝试进行预测时,我遇到了以下异常。

raise ValueError('Input numpy.ndarray must be 2 dimensional')
ValueError: Input numpy.ndarray must be 2 dimensional

我的model是一个带有一些预处理(变量编码)和超参数调优的xgboost模型。

import pandas as pd
import pickle
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder 

# split df into train and test
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:,0:21], df.iloc[:,-1], test_size=0.1)

X_train.shape
(1000,21)

# Encode categorical variables  
cat_vars = ['cat1','cat2','cat3']
cat_transform = ColumnTransformer([('cat', OneHotEncoder(handle_unknown='ignore'), cat_vars)], remainder='passthrough')

encoder = cat_transform.fit(X_train)
X_train = encoder.transform(X_train)
X_test = encoder.transform(X_test)

X_train.shape
(1000,420)

# Define a xgboost regression model
model = XGBRegressor()

# Do hyper-parameter tuning
.....

# Fit model
model.fit(X_train, y_train)

model对象如下所示:

XGBRegressor(colsample_bytree=xxx, gamma=xxx,
             learning_rate=xxx, max_depth=x, n_estimators=xxx,
             subsample=xxx)

我的测试数据是一个浮点值字符串,它被转换成一个数组,因为数据必须作为numpy数组传递。

testdata = [........., 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 2000, 200, 85, 412412, 123, 41, 552, 50000, 512, 0.1, 10.0, 2.0, 0.05]

我尝试过将numpy数组从1d重新调整为2d,但是,这不起作用,因为测试数据和训练模型之间的特征数量不匹配。
我的问题是如何传递一个长度与trained model中的# of features相同的numpy数组?有什么解决办法吗?我可以通过本地传递测试数据列表来进行预测。
有关推理脚本的详细信息,请单击此处:https://github.com/aws-samples/amazon-sagemaker-local-mode/blob/main/xgboost_script_mode_local_training_and_serving/code/inference.py

Traceback (most recent call last):
File "/miniconda3/lib/python3.6/site-packages/sagemaker_containers/_functions.py", line 93, in wrapper
return fn(*args, **kwargs)
File "/opt/ml/code/inference.py", line 75, in predict_fn
prediction = model.predict(input_data)
File "/miniconda3/lib/python3.6/site-packages/xgboost/sklearn.py", line 448, in predict
test_dmatrix = DMatrix(data, missing=self.missing, nthread=self.n_jobs)
File "/miniconda3/lib/python3.6/site-packages/xgboost/core.py", line 404, in __init__
self._init_from_npy2d(data, missing, nthread)
File "/miniconda3/lib/python3.6/site-packages/xgboost/core.py", line 474, in _init_from_npy2d
raise ValueError('Input numpy.ndarray must be 2 dimensional')
ValueError: Input numpy.ndarray must be 2 dimensional

当我尝试使用testdata.reshape(-1,1)将测试数据重新整形为2d numpy数组时,遇到feature_names不匹配异常。

File "/opt/ml/code/inference.py", line 75, in predict_fn
3n0u6hucsr-algo-1-qbiyg  |     prediction = model.predict(input_data)
3n0u6hucsr-algo-1-qbiyg  |   File "/miniconda3/lib/python3.6/site-packages/xgboost/sklearn.py", line 456, in predict
3n0u6hucsr-algo-1-qbiyg  |     validate_features=validate_features)
3n0u6hucsr-algo-1-qbiyg  |   File "/miniconda3/lib/python3.6/site-packages/xgboost/core.py", line 1284, in predict
3n0u6hucsr-algo-1-qbiyg  |     self._validate_features(data)
3n0u6hucsr-algo-1-qbiyg  |   File "/miniconda3/lib/python3.6/site-packages/xgboost/core.py", line 1690, in _validate_features
3n0u6hucsr-algo-1-qbiyg  |     data.feature_names))
3n0u6hucsr-algo-1-qbiyg  | ValueError: feature_names mismatch: ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15',

更新:我可以通过运行model.get_booster().feature_names来检索模型的特性名称。有没有方法可以使用这些名称并分配给测试数据点,以使它们保持一致?

['f0', 'f1', 'f2', 'f3', 'f4', 'f5',......'f417','f418','f419']
yhived7q

yhived7q1#

我认为解决方案是提供与训练数据相同数据类型的测试数据。
谢谢你的评论。加上编码后X_train的数据类型是scipy.sparse.csr.csr_matrixy_trainPandas series的附加信息。如果没有内存限制,我们可以使用以下命令将两者都转换为numpy数组:

model.fit(X_train.toarray(), y_train.to_numpy())

参考:

s6fujrry

s6fujrry2#

试试这个:

import numpy as np
model.predict(np.array([[x1, x2, x3]]))

其中x1、x2、x3是您的特性,并对xgboost_model进行建模。

相关问题