Python Panda Read_CSV未适当分隔列

mnemlml8 于 2022-12-06 发布在 Python

关注(0)|答案(1)|浏览(259)

在我的一个课程中，我们需要使用python加载和附加6个CSV文件。这些文件没有包含头。我已经尝试了几个小时，多种方法要么直接组合CSV文件，要么单独阅读它们并附加它们，使用不同的在线指南。对于什么应该是一个简单的任务，我遇到了许多问题。当我尝试一种方法，包括加入文件的权利出了大门，我得到了许多错误信息的回应。
为了确认，我需要执行以下初始步骤：

将CSV文件从本地目录加载到 Dataframe
添加CSV中缺少的提供的标头
将6个CSV文件一起追加到一个整合的数据集中
（不一定按此顺序）*

数据文件位于此处，用于再现。https://drive.google.com/drive/folders/1ZKBFbsUBNUhsWtVtsMqOtXKx4SL-pFnt?usp=sharing
以下是我们使用的文件x1c 0d1x

我尝试使用以下脚本从一开始就将所有CSV附加到一起，但收到了大量错误

import pandas as pd
import glob
import os

# setting the path for joining multiple files
files = os.path.join("D:/User Data/Dropbox/2022-10-19 Semester/StudentFiles/StudentDataFiles/Data Files/", "*.csv")

# list of merged files returned
files = glob.glob(files)

print(files);

# joining files with concat and read_csv
df = pd.concat(map(pd.read_csv, files), ignore_index=True)
print(df)

脚本一直运行到files = glob.glob（files）部分，因为我可以打印结果。

Traceback (most recent call last):
  File "D:/User Data/Dropbox/2022-10-19 Semester/StudentFiles/StudentDataFiles/IN498_M2_2.py", line 14, in <module>
    df = pd.concat(map(pd.read_csv, files), ignore_index=True)
  File "C:\Users\KDPen\anaconda3\lib\site-packages\pandas\util\_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "C:\Users\KDPen\anaconda3\lib\site-packages\pandas\core\reshape\concat.py", line 347, in concat
    op = _Concatenator(
  File "C:\Users\KDPen\anaconda3\lib\site-packages\pandas\core\reshape\concat.py", line 401, in __init__
    objs = list(objs)
  File "C:\Users\KDPen\anaconda3\lib\site-packages\pandas\util\_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "C:\Users\KDPen\anaconda3\lib\site-packages\pandas\io\parsers\readers.py", line 680, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "C:\Users\KDPen\anaconda3\lib\site-packages\pandas\io\parsers\readers.py", line 575, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "C:\Users\KDPen\anaconda3\lib\site-packages\pandas\io\parsers\readers.py", line 933, in __init__
    self._engine = self._make_engine(f, self.engine)
  File "C:\Users\KDPen\anaconda3\lib\site-packages\pandas\io\parsers\readers.py", line 1235, in _make_engine
    return mapping[engine](f, **self.options)
  File "C:\Users\KDPen\anaconda3\lib\site-packages\pandas\io\parsers\c_parser_wrapper.py", line 75, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas\_libs\parsers.pyx", line 544, in pandas._libs.parsers.TextReader.__cinit__
  File "pandas\_libs\parsers.pyx", line 633, in pandas._libs.parsers.TextReader._get_header
  File "pandas\_libs\parsers.pyx", line 847, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas\_libs\parsers.pyx", line 1952, in pandas._libs.parsers.raise_parser_error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

我认为这些错误可能与CSV文件的数据布局或结构有关，因此，我尝试从简单地将单个CSV文件上传到python Dataframe 并向其添加标题开始。然而，当我尝试这样做时，数据似乎被识别为只有一列，如下所示，就像这些列没有被分隔符分隔，即使它是逗号分隔的。这应该是panda中本机可读的。所以，我想问题可能是缺少标头，或者数据集中缺少值之类的，但我不知道是什么导致了这些问题....我尝试了多种方法来解决这个问题，但都没有用。我尝试在read_csv函数中使用各种属性，包括：名称=信头清单、编码、信头=无、keep_default_na=False、sep ='、'、skiprows=[0]，以及其他一些。

import pandas as pd
import glob
import os
import csv

headerslist = ['Date','Package_Name','Country','Store_Listing_Visitors','Installers','Visitor-to-Installer_conversion_rate','Installers_retained_for_1_day','Installer-to-1_day_retention_rate','Installers_retained_for_7_days','Installer-to-7_days_retention_rate','Installers_retained_for_15_days','Installer-to-15_days_retention_rate','Installers_retained_for_30_days','Installer-to-30_days_retention_rate']

df = pd.read_csv('D:/User Data/Dropbox/2022-10-19 Semester/StudentFiles/StudentDataFiles/Data Files/retained_installers_com.foo.bar_201904_country.csv', keep_default_na=False, sep=',',  skiprows=[0], delimiter=None,  header=None, encoding='cp1252')

df2 = headerslist.append(df)

print(df)

我尝试了Adrian的解决方案，但返回了一个奇怪的错误

import csv
import glob
import os

files = os.path.join("D:/User Data/Dropbox/Kristophers Files/School (Purdue Global)/2022-10-19 Semester/IN498 - Capstone/StudentFiles/StudentDataFiles/Data Files/", "*.csv")

# list of merged files returned
files = glob.glob(files)

header = None
new_file = []
for f in (files):
    with open(f, newline='') as csv_file:
        reader = csv.reader(csv_file)
        if not header:
            new_file.append(next(reader))
            header = True
        else:
            next(reader)
        for row in reader:
            new_file.append(row)

with open('CombinedCSV.csv', 'w', newline='') as csv_out:
    writer = csv.writer(csv_out)
    writer.writerows(new_file)

错误/追溯：

C:\Users\KDPen\anaconda3\python.exe "D:\User Data\Dropbox\2022-10-19 Semester\IN498_M2_3.py" 
Traceback (most recent call last):
  File "D:\User Data\Dropbox\2022-10-19 Semester\IN498_M2_3.py", line 20, in <module>
    next(reader)
_csv.Error: line contains NUL

Process finished with exit code 1

csv

来源：https://stackoverflow.com/questions/74230519/python-panda-read-csv-not-separating-columns-appropriately

1条答案

按热度按时间

yrdbyhpb1#

这是因为有些文件是用UTF-16编码的，你可以找到一个编码，并在调用pandas.read_csv()时指定这个编码。

import codecs
import pandas as pd

def find_encoding(path):
    with open(path, 'rb') as f:
        bytes = f.read(4)
    for bom, encoding in (
        (codecs.BOM_UTF8, 'utf-8-sig'),
        (codecs.BOM_UTF32_LE, 'utf-32'),
        (codecs.BOM_UTF32_BE, 'utf-32'),
        (codecs.BOM_UTF16_LE, 'utf-16'),
        (codecs.BOM_UTF16_BE, 'utf-16'),
    ):
        if bytes.startswith(bom):
            return encoding
    return None
...
def read_csv(path):
    return pd.read_csv(path, encoding=find_encoding(path), header=0)
df = pd.concat(map(read_csv, files), ignore_index=True)

而且您的文件在行首和行尾包含额外的双引号。要处理它们，请像这样做，而不是上面的read_csv()。

def read_csv(path):
    lines = []
    with open(path, 'rt', encoding=find_encoding(path)) as f:
        for line in f:
            lines.append(line.rstrip()[1:-1])
    return pd.read_csv(io.StringIO('\n'.join(lines)), header=0)

赞(0）回复(0）举报 2022-12-06

我来回答

Python Panda Read_CSV未适当分隔列

1条答案

相关问题

热门标签

最新问答