pandas 将数据追加到现有数据框以重新训练模型时出错

h22fl7wq  于 2022-11-20  发布在  其他
关注(0)|答案(1)|浏览(213)

我向X_train数据和y_train数据中添加了更多数据,以便用更多数据重新训练模型。我使用pd. concat()完成了这一操作。但是,当我使用连接数据集训练模型时,我得到了以下错误:

/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py:1692: 
FutureWarning: Feature names only support names that are all strings. Got feature 
names with dtypes: ['int', 'str']. An error will be raised in 1.2.
  FutureWarning,
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-166-a11464987b97> in <module>
----> 1 model1_pool_preds = model1(LinearSVC(class_weight='balanced', 
random_state=42), OneVsRestClassifier, X_train_init_new, y_train_init_new, 
X_test_init, y_test_init, X_pool)

6 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/generic.py in __array__(self, 
dtype)
   1991 
   1992     def __array__(self, dtype: NpDtype | None = None) -> np.ndarray:
-> 1993         return np.asarray(self._values, dtype=dtype)
   1994 
   1995     def __array_wrap__(

 ValueError: could not convert string to float:

我想这是因为我添加到现有 Dataframe 中的数据包含一些字符串而不是浮点数。我如何将整个数据集转换为浮点数?我的代码如下:

y_train_init_new = pd.concat([y_train_init, X_pool_labeled.iloc[:, -7:]])
X_train_init_new = pd.concat([X_train_init, X_pool_labeled.iloc[:, 0:27446]])

def model1(model, classifier, X, y, X_test, y_test, X_pool):
  m = model
  clf = classifier(m)
  clf.fit(X,y)
  clf_predictions = clf.predict(X_test)
  C_report = classification_report(y_test, clf_predictions, zero_division=0)
  print(C_report)

  clf_roc_auc = roc_auc_score(y_test, clf_predictions, multi_class='ovr')
  print('AUC: ', clf_roc_auc)
  clf_predictions_pool = clf.predict(X_pool)
  return clf_predictions_pool

model1_pool_preds = model1(LinearSVC(class_weight='balanced', random_state=42), 
OneVsRestClassifier, X_train_init, y_train_init, X_test_init, y_test_init, X_pool)

如何将连接数据集的所有数据转换为浮点数据?

9gm1akwq

9gm1akwq1#

给定一个完全是字符串的 Dataframe ,但它可以毫无错误地转换为数字,您只需对整个批次调用df.astype(float)

>>> df = pd.DataFrame([str(i) for i in range(0, 1000)], columns=['x'])
>>> df
       x
0      0
1      1
2      2
3      3
4      4
..   ...
995  995
996  996
997  997
998  998
999  999

[1000 rows x 1 columns]

>>> df.astype(float)
         x
0      0.0
1      1.0
2      2.0
3      3.0
4      4.0
..     ...
995  995.0
996  996.0
997  997.0
998  998.0
999  999.0

[1000 rows x 1 columns]

如果你有混合的非数值列,这就更困难了,因为这样的列无论如何都不能使用,只要删除它们,然后对剩余的列调用astype(float)

相关问题