我有一个在本地训练并部署到引擎的模型,这样我就可以进行推断/调用端点。当我尝试进行预测时,我遇到了以下异常。
raise ValueError('Input numpy.ndarray must be 2 dimensional')
ValueError: Input numpy.ndarray must be 2 dimensional
我的model
是一个带有一些预处理(变量编码)和超参数调优的xgboost模型。
import pandas as pd
import pickle
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
# split df into train and test
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:,0:21], df.iloc[:,-1], test_size=0.1)
X_train.shape
(1000,21)
# Encode categorical variables
cat_vars = ['cat1','cat2','cat3']
cat_transform = ColumnTransformer([('cat', OneHotEncoder(handle_unknown='ignore'), cat_vars)], remainder='passthrough')
encoder = cat_transform.fit(X_train)
X_train = encoder.transform(X_train)
X_test = encoder.transform(X_test)
X_train.shape
(1000,420)
# Define a xgboost regression model
model = XGBRegressor()
# Do hyper-parameter tuning
.....
# Fit model
model.fit(X_train, y_train)
model
对象如下所示:
XGBRegressor(colsample_bytree=xxx, gamma=xxx,
learning_rate=xxx, max_depth=x, n_estimators=xxx,
subsample=xxx)
我的测试数据是一个浮点值字符串,它被转换成一个数组,因为数据必须作为numpy数组传递。
testdata = [........., 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 2000, 200, 85, 412412, 123, 41, 552, 50000, 512, 0.1, 10.0, 2.0, 0.05]
我尝试过将numpy数组从1d重新调整为2d,但是,这不起作用,因为测试数据和训练模型之间的特征数量不匹配。
我的问题是如何传递一个长度与trained model中的# of features相同的numpy数组?有什么解决办法吗?我可以通过本地传递测试数据列表来进行预测。
有关推理脚本的详细信息,请单击此处:https://github.com/aws-samples/amazon-sagemaker-local-mode/blob/main/xgboost_script_mode_local_training_and_serving/code/inference.py
Traceback (most recent call last):
File "/miniconda3/lib/python3.6/site-packages/sagemaker_containers/_functions.py", line 93, in wrapper
return fn(*args, **kwargs)
File "/opt/ml/code/inference.py", line 75, in predict_fn
prediction = model.predict(input_data)
File "/miniconda3/lib/python3.6/site-packages/xgboost/sklearn.py", line 448, in predict
test_dmatrix = DMatrix(data, missing=self.missing, nthread=self.n_jobs)
File "/miniconda3/lib/python3.6/site-packages/xgboost/core.py", line 404, in __init__
self._init_from_npy2d(data, missing, nthread)
File "/miniconda3/lib/python3.6/site-packages/xgboost/core.py", line 474, in _init_from_npy2d
raise ValueError('Input numpy.ndarray must be 2 dimensional')
ValueError: Input numpy.ndarray must be 2 dimensional
当我尝试使用testdata.reshape(-1,1)
将测试数据重新整形为2d numpy数组时,遇到feature_names
不匹配异常。
File "/opt/ml/code/inference.py", line 75, in predict_fn
3n0u6hucsr-algo-1-qbiyg | prediction = model.predict(input_data)
3n0u6hucsr-algo-1-qbiyg | File "/miniconda3/lib/python3.6/site-packages/xgboost/sklearn.py", line 456, in predict
3n0u6hucsr-algo-1-qbiyg | validate_features=validate_features)
3n0u6hucsr-algo-1-qbiyg | File "/miniconda3/lib/python3.6/site-packages/xgboost/core.py", line 1284, in predict
3n0u6hucsr-algo-1-qbiyg | self._validate_features(data)
3n0u6hucsr-algo-1-qbiyg | File "/miniconda3/lib/python3.6/site-packages/xgboost/core.py", line 1690, in _validate_features
3n0u6hucsr-algo-1-qbiyg | data.feature_names))
3n0u6hucsr-algo-1-qbiyg | ValueError: feature_names mismatch: ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15',
更新:我可以通过运行model.get_booster().feature_names
来检索模型的特性名称。有没有方法可以使用这些名称并分配给测试数据点,以使它们保持一致?
['f0', 'f1', 'f2', 'f3', 'f4', 'f5',......'f417','f418','f419']
2条答案
按热度按时间yhived7q1#
我认为解决方案是提供与训练数据相同数据类型的测试数据。
谢谢你的评论。加上编码后
X_train
的数据类型是scipy.sparse.csr.csr_matrix
和y_train
是Pandas series
的附加信息。如果没有内存限制,我们可以使用以下命令将两者都转换为numpy数组:参考:
s6fujrry2#
试试这个:
其中x1、x2、x3是您的特性,并对xgboost_model进行建模。