pandas 在XGBoost算法中使用分类变量预测器

cngwdvgl 于 2023-04-28 发布在其他

关注(0)|答案(2)|浏览(223)

我试图在xgboost算法中使用分类预测器，但不断出现错误。下面是我的代码的相关部分。

df = data[["country_name", "Timestamp", "Flow Duration", "Flow IAT Min", "Src Port", "Tot Fwd Pkts", "Init Bwd Win Byts", "Label"]]
from pandas.api.types import CategoricalDtype
df["country_name"] = df["country_name"].astype(CategoricalDtype(ordered=True))

X = df[["country_name", "Flow Duration", "Flow IAT Min", "Src Port", "Tot Fwd Pkts", "Init Bwd Win Byts"]]
df["Label"] = df["Label"].replace(['benign','ddos'],[0,1])
y = df["Label"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

model2 = xgb.XGBClassifier(tree_method="gpu_hist", enable_categorical=True, use_label_encoder = False)

model2.fit(X_train,y_train)

我也试过使用。astype（“category”）也不起作用。当我运行最后一段代码时，我总是得到这个错误：

ValueError: DataFrame.dtypes for data must be int, float, bool or categorical.  When
                categorical type is supplied, DMatrix parameter
                `enable_categorical` must be set to `True`.country_name

任何帮助将不胜感激，谢谢！！

pandas

来源：https://stackoverflow.com/questions/73178299/using-a-categorical-variable-predictor-in-xgboost-algorithm

2条答案

按热度按时间

7kqas0il1#

您可以显式地使您的DMatrix，这就是您需要启用分类
例如

train_x, valid_x, train_y, valid_y = train_test_split(x_subfeatures, y_encoded, train_size=.75)

dtrain = xgb.DMatrix(
    train_x, 
    label=train_y,
    #enable categorical data
    enable_categorical=True
)

dvalid = xgb.DMatrix(
    valid_x,
    label=valid_y,
    enable_categorical=True
)

赞(0）回复(0）举报 2023-04-28

ukdjmx9f2#

理想情况下，您检查/附加。所有相关预测变量的dtypes。
在这种特定情况下，country_name可能是对象类型i。也就是说，你需要先对这个变量进行编码。
要进行编码，您可以选择以下选项：https://contrib.scikit-learn.org/category_encoders/

赞(0）回复(0）举报 2023-04-28