我试图按照this教程构建一个机器学习,它在虹膜数据集上工作得很好,然而,当我试图使用我自己的CSV(用于一个项目)时,它给了我一个错误。当我试图使用一种不同的、无关的方法时,同样的事情发生了。(其余细节在底部)以下是我的代码:
# Python version
import sys
from sklearn.metrics import make_scorer
print('Python: {}'.format(sys.version))
# scipy
import scipy
print('scipy: {}'.format(scipy.__version__))
# numpy
import numpy
print('numpy: {}'.format(numpy.__version__))
# matplotlib
import matplotlib
print('matplotlib: {}'.format(matplotlib.__version__))
# pandas
import pandas
print('pandas: {}'.format(pandas.__version__))
# scikit-learn
import sklearn
print('sklearn: {}'.format(sklearn.__version__))
# compare algorithms
from pandas import read_csv
from matplotlib import pyplot
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.feature_selection import RFE
# Load dataset
url = "energy.csv"
# url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv"
names = ['YEAR', 'TOTAL', 'PURCHASED', 'NUCLEAR', 'SOLAR', 'WIND', 'NATURAL_GAS', 'COAL', 'OIL']
dataset = read_csv(url, names=names)
print(dataset.shape)
# Split-out validation dataset
array = dataset.values
X = array[:, 0:8]
y = array[:, 8]
print(y)
我的CSV:
18,28,564,0,6284.08,1713.84,19.9948,19994.8,19.9948,19.9948
17,28,411,0,6250.42,852.33,0,20740.03,568.22,0
16,27,515,0,6053.3,550.3,0,20361.1,550.3,0
15,24,586,491.72,5408.92,245.86,0,17947.78,491.72,0
14,26,653,533.06,6130.19,0,0,18923.63,1066.12,0
13,26,836,805.08,6172.28,0,0,18785.2,1073.44,0
12,26,073,1303.65,5736.06,0,0,17990.37,1042.92,0
11,27,055,1352.75,6222.65,0,0,18397.4,1082.2,0
10,26,236,1311.8,6034.28,0,0,17578.12,1311.8,0
9,26,020,1821.4,3903,0,0,18994.6,1040.8,260.2
8,26,538,0,4246.08,265.38,13799.76,6369.12,0,1326.9
7,25800,3354,5160,0,0,14964,1290,1032
6,26682,3468.66,5603.22,0,0,14941.92,1600.92,1067.28
5,24997,3499.58,5499.34,0,0,13248.41,1499.82,1249.85
4,25100,3765,4769,0,0,13052,1506,2008
3,24651,4190.67,4930.2,0,0,12325.5,1232.55,1972.08
2,12,053,0,1084.77,0,3133.78,6508.62,0,723.18
1,11,500,2070,2415,0,0,4255,690,2070
当我在最后一行打印y时,我得到的结果是:
[ 19.9948 0. 0. 0. 0. 0. 0.
0. 0. 260.2 1326.9 nan nan nan
nan nan 723.18 2070. ]
我认为这是不应该发生的(‘南’的事情)。我在这个领域没有超级经验,所以任何关于正在发生的事情的方向都会受到感谢,提前谢谢。
1条答案
按热度按时间kpbwa7wx1#
并非CSV中的所有行都包含所有列(10)上的数据。当有丢失的数据时,它被表示为NaN(非数字的缩写)。
请参见第11-15行。在这个常见问题解答中,你可以找到更多关于Pandas中的南安的信息。
**编辑:**可能整个 Dataframe 很难看到。下面我只显示10个以上的内容,以及第5列以上的内容。