pandas 从文本文件创建数据框

8cdiaqws 于 2022-12-09 发布在其他

关注(0)|答案(1)|浏览(149)

我有一个超过1000个txt文件的数据集，其中包含书籍的信息

The Project Gutenberg EBook of Apocolocyntosis, by Lucius Seneca

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org

Title: Apocolocyntosis

Author: Lucius Seneca

Release Date: November 10, 2003 [EBook #10001]
[Date last updated: April 9, 2005]

Language: English

Character set encoding: ASCII

*** START OF THIS PROJECT GUTENBERG EBOOK APOCOLOCYNTOSIS ***

我尝试使用panda来读取这些文件，并从中创建一个数据框，将标题、作者、发布日期和语言作为列及其值，但到目前为止，我一直遇到错误
从单个文件阅读

df = pd.read_csv('dataset/10001.txt')

错误

ParserError                               Traceback (most recent call last)
Input In [30], in <cell line: 1>()
----> 1 df = pd.read_csv('dataset/10001.txt')

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\pandas\util\_decorators.py:311, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    305 if len(args) > num_allow_args:
    306     warnings.warn(
    307         msg.format(arguments=arguments),
    308         FutureWarning,
    309         stacklevel=stacklevel,
    310     )
--> 311 return func(*args, **kwargs)

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\pandas\io\parsers\readers.py:680, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
    665 kwds_defaults = _refine_defaults_read(
    666     dialect,
    667     delimiter,
   (...)
    676     defaults={"delimiter": ","},
    677 )
    678 kwds.update(kwds_defaults)
--> 680 return _read(filepath_or_buffer, kwds)

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\pandas\io\parsers\readers.py:581, in _read(filepath_or_buffer, kwds)
    578     return parser
    580 with parser:
--> 581     return parser.read(nrows)

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\pandas\io\parsers\readers.py:1254, in TextFileReader.read(self, nrows)
   1252 nrows = validate_integer("nrows", nrows)
   1253 try:
-> 1254     index, columns, col_dict = self._engine.read(nrows)
   1255 except Exception:
   1256     self.close()

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\pandas\io\parsers\c_parser_wrapper.py:225, in CParserWrapper.read(self, nrows)
    223 try:
    224     if self.low_memory:
--> 225         chunks = self._reader.read_low_memory(nrows)
    226         # destructive to chunks
    227         data = _concatenate_chunks(chunks)

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\pandas\_libs\parsers.pyx:805, in pandas._libs.parsers.TextReader.read_low_memory()

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\pandas\_libs\parsers.pyx:861, in pandas._libs.parsers.TextReader._read_rows()

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\pandas\_libs\parsers.pyx:847, in pandas._libs.parsers.TextReader._tokenize_rows()

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\pandas\_libs\parsers.pyx:1960, in pandas._libs.parsers.raise_parser_error()

ParserError: Error tokenizing data. C error: Expected 2 fields in line 60, saw 3

来源：https://stackoverflow.com/questions/74725071/creating-data-frame-from-text-file

1条答案

按热度按时间

The following code shows how you can tackle the data extraction for one file.
Providing they are all in the same format, then this should be pretty efficient.

re.compile : provides the regex to use to find the item of interest
had to do some extra manipulation with release_date because of extra text on that line.
you could add a for-loop to navigate through the 1000s of books.

Code:

import re
import pandas as pd

with open('dataset/10001.txt', 'r') as text_file:
    text = text_file.read()

# These can be reused for each book    
title = re.compile(r'Title: (.*)\n')
author = re.compile(r'Author: (.*)\n')
release_date = re.compile(r'Release Date: (.*)\s')

book_title = title.search(text).group(1)
book_author = author.search(text).group(1)
book_release = release_date.search(text).group(1).split(' [')[0]

df = pd.DataFrame({"Title": [book_title], "Author": [book_author], "Release_Date": [book_release]})
print(df)

Output:

data.txt

The Project Gutenberg EBook of Apocolocyntosis, by Lucius Seneca

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org

Title: Apocolocyntosis

Author: Lucius Seneca

Release Date: November 10, 2003 [EBook #10001]
[Date last updated: April 9, 2005]

Language: English

Character set encoding: ASCII

*** START OF THIS PROJECT GUTENBERG EBOOK APOCOLOCYNTOSIS ***

赞(0）回复(0）举报 2022-12-09

相关问题

热门标签

Java query python Node 开发语言 request Util 数据库 Table 后端算法 Logger Message Element Parser

最新问答

xxl-job 安全组扫描到执行器端口服务存在信息泄露漏洞
回答(1) 发布于 4个月前
xxl-job 不能和nacos兼容？
回答(3) 发布于 4个月前
xxl-job 任务执行完后无法结束，日志一直转圈
回答(3) 发布于 4个月前
xxl-job-admin页面上查看调度日志样式问题
回答(1) 发布于 4个月前
xxl-job 参数512字符限制能否去掉
回答(1) 发布于 4个月前