numpy CSV在格式化时导致‘NaN’出现问题,我该怎么办?

ep6jt1vc  于 2022-11-10  发布在  其他
关注(0)|答案(1)|浏览(175)

我试图按照this教程构建一个机器学习,它在虹膜数据集上工作得很好,然而,当我试图使用我自己的CSV(用于一个项目)时,它给了我一个错误。当我试图使用一种不同的、无关的方法时,同样的事情发生了。(其余细节在底部)以下是我的代码:


# Python version

import sys

from sklearn.metrics import make_scorer

print('Python: {}'.format(sys.version))

# scipy

import scipy

print('scipy: {}'.format(scipy.__version__))

# numpy

import numpy

print('numpy: {}'.format(numpy.__version__))

# matplotlib

import matplotlib

print('matplotlib: {}'.format(matplotlib.__version__))

# pandas

import pandas

print('pandas: {}'.format(pandas.__version__))

# scikit-learn

import sklearn

print('sklearn: {}'.format(sklearn.__version__))

# compare algorithms

from pandas import read_csv
from matplotlib import pyplot
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.feature_selection import RFE

# Load dataset

url = "energy.csv"

# url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv"

names = ['YEAR', 'TOTAL', 'PURCHASED', 'NUCLEAR', 'SOLAR', 'WIND', 'NATURAL_GAS', 'COAL', 'OIL']
dataset = read_csv(url, names=names)
print(dataset.shape)

# Split-out validation dataset

array = dataset.values
X = array[:, 0:8]
y = array[:, 8]

print(y)

我的CSV:

18,28,564,0,6284.08,1713.84,19.9948,19994.8,19.9948,19.9948
17,28,411,0,6250.42,852.33,0,20740.03,568.22,0
16,27,515,0,6053.3,550.3,0,20361.1,550.3,0
15,24,586,491.72,5408.92,245.86,0,17947.78,491.72,0
14,26,653,533.06,6130.19,0,0,18923.63,1066.12,0
13,26,836,805.08,6172.28,0,0,18785.2,1073.44,0
12,26,073,1303.65,5736.06,0,0,17990.37,1042.92,0
11,27,055,1352.75,6222.65,0,0,18397.4,1082.2,0
10,26,236,1311.8,6034.28,0,0,17578.12,1311.8,0
9,26,020,1821.4,3903,0,0,18994.6,1040.8,260.2
8,26,538,0,4246.08,265.38,13799.76,6369.12,0,1326.9
7,25800,3354,5160,0,0,14964,1290,1032
6,26682,3468.66,5603.22,0,0,14941.92,1600.92,1067.28
5,24997,3499.58,5499.34,0,0,13248.41,1499.82,1249.85
4,25100,3765,4769,0,0,13052,1506,2008
3,24651,4190.67,4930.2,0,0,12325.5,1232.55,1972.08
2,12,053,0,1084.77,0,3133.78,6508.62,0,723.18
1,11,500,2070,2415,0,0,4255,690,2070

当我在最后一行打印y时,我得到的结果是:

[  19.9948    0.        0.        0.        0.        0.        0.
    0.        0.      260.2    1326.9          nan       nan       nan
       nan       nan  723.18   2070.    ]

我认为这是不应该发生的(‘南’的事情)。我在这个领域没有超级经验,所以任何关于正在发生的事情的方向都会受到感谢,提前谢谢。

kpbwa7wx

kpbwa7wx1#

并非CSV中的所有行都包含所有列(10)上的数据。当有丢失的数据时,它被表示为NaN(非数字的缩写)。

In [42]: df = pd.read_csv("/tmp/energy.csv", header=None)
In [43]: df
Out[43]: 
     0      1        2        3        4        5           6         7          8          9
0   18     28   564.00     0.00  6284.08  1713.84     19.9948  19994.80    19.9948    19.9948
1   17     28   411.00     0.00  6250.42   852.33      0.0000  20740.03   568.2200     0.0000
2   16     27   515.00     0.00  6053.30   550.30      0.0000  20361.10   550.3000     0.0000
3   15     24   586.00   491.72  5408.92   245.86      0.0000  17947.78   491.7200     0.0000
4   14     26   653.00   533.06  6130.19     0.00      0.0000  18923.63  1066.1200     0.0000
5   13     26   836.00   805.08  6172.28     0.00      0.0000  18785.20  1073.4400     0.0000
6   12     26    73.00  1303.65  5736.06     0.00      0.0000  17990.37  1042.9200     0.0000
7   11     27    55.00  1352.75  6222.65     0.00      0.0000  18397.40  1082.2000     0.0000
8   10     26   236.00  1311.80  6034.28     0.00      0.0000  17578.12  1311.8000     0.0000
9    9     26    20.00  1821.40  3903.00     0.00      0.0000  18994.60  1040.8000   260.2000
10   8     26   538.00     0.00  4246.08   265.38  13799.7600   6369.12     0.0000  1326.9000
11   7  25800  3354.00  5160.00     0.00     0.00  14964.0000   1290.00  1032.0000        NaN
12   6  26682  3468.66  5603.22     0.00     0.00  14941.9200   1600.92  1067.2800        NaN
13   5  24997  3499.58  5499.34     0.00     0.00  13248.4100   1499.82  1249.8500        NaN
14   4  25100  3765.00  4769.00     0.00     0.00  13052.0000   1506.00  2008.0000        NaN
15   3  24651  4190.67  4930.20     0.00     0.00  12325.5000   1232.55  1972.0800        NaN
16   2     12    53.00     0.00  1084.77     0.00   3133.7800   6508.62     0.0000   723.1800
17   1     11   500.00  2070.00  2415.00     0.00      0.0000   4255.00   690.0000  2070.0000

请参见第11-15行。在这个常见问题解答中,你可以找到更多关于Pandas中的南安的信息。

**编辑:**可能整个 Dataframe 很难看到。下面我只显示10个以上的内容,以及第5列以上的内容。

In [57]: df.iloc[10:, 5:]
Out[57]: 
         5         6        7        8        9
10  265.38  13799.76  6369.12     0.00  1326.90
11    0.00  14964.00  1290.00  1032.00      NaN
12    0.00  14941.92  1600.92  1067.28      NaN
13    0.00  13248.41  1499.82  1249.85      NaN
14    0.00  13052.00  1506.00  2008.00      NaN
15    0.00  12325.50  1232.55  1972.08      NaN
16    0.00   3133.78  6508.62     0.00   723.18
17    0.00      0.00  4255.00   690.00  2070.00
In [58]: df.iloc[11:16, 9]
Out[58]: 
11   NaN
12   NaN
13   NaN
14   NaN
15   NaN
Name: 9, dtype: float64

In [59]: df.iloc[11:16, 9].isna()
Out[59]: 
11    True
12    True
13    True
14    True
15    True
Name: 9, dtype: bool

相关问题