将csv导入panda Dataframe 时未阅读所有行

f1tvaqid  于 2022-12-15  发布在  其他
关注(0)|答案(1)|浏览(129)

我正在尝试kaggle挑战here,不幸的是我被困在一个非常基本的步骤。我正在尝试通过执行以下命令将datasets读入panda Dataframe :

test = pd.DataFrame.from_csv("C:/Name/DataMining/hillary/data/output/emails.csv")

问题是,正如你所发现的,这个文件有超过30万条记录,但我只阅读到了7945条。

print (test.shape)
(7945, 21)

现在我已经仔细检查了文件,我没有发现关于第7945行的任何特殊情况。有什么提示吗?

puruo6ea

puruo6ea1#

我认为更好的方法是使用带有参数quoting=csv.QUOTE_NONEerror_bad_lines=False的函数read_csv。链接

import pandas as pd
import csv

test = pd.read_csv("output/Emails.csv", quoting=csv.QUOTE_NONE, error_bad_lines=False)

print (test.shape)
#(381422, 22)

但是一些数据(有问题的)将被跳过。
如果您想跳过电子邮件正文数据,您可以用途:

import pandas as pd
import csv

test = pd.read_csv(
    "output/Emails.csv",
    quoting=csv.QUOTE_NONE,
    sep=',',
    error_bad_lines=False,
    header=None,
    names=[
        "Id", "DocNumber", "MetadataSubject", "MetadataTo", "MetadataFrom",
        "SenderPersonId", "MetadataDateSent", "MetadataDateReleased",
        "MetadataPdfLink", "MetadataCaseNumber", "MetadataDocumentClass",
        "ExtractedSubject", "ExtractedTo", "ExtractedFrom", "ExtractedCc",
        "ExtractedDateSent", "ExtractedCaseNumber", "ExtractedDocNumber",
        "ExtractedDateReleased", "ExtractedReleaseInPartOrFull",
        "ExtractedBodyText", "RawText"])

print (test.shape)

#delete row with NaN in column MetadataFrom
test = test.dropna(subset=['MetadataFrom'])
#delete headers in data
test = test[test.MetadataFrom != 'MetadataFrom']

相关问题