Python Panda Read_CSV未适当分隔列

mnemlml8  于 2022-12-06  发布在  Python
关注(0)|答案(1)|浏览(259)

在我的一个课程中,我们需要使用python加载和附加6个CSV文件。这些文件没有包含头。我已经尝试了几个小时,多种方法要么直接组合CSV文件,要么单独阅读它们并附加它们,使用不同的在线指南。对于什么应该是一个简单的任务,我遇到了许多问题。当我尝试一种方法,包括加入文件的权利出了大门,我得到了许多错误信息的回应。
为了确认,我需要执行以下初始步骤:

  • 将CSV文件从本地目录加载到 Dataframe
  • 添加CSV中缺少的提供的标头
  • 将6个CSV文件一起追加到一个整合的数据集中
  • (不一定按此顺序)*

数据文件位于此处,用于再现。https://drive.google.com/drive/folders/1ZKBFbsUBNUhsWtVtsMqOtXKx4SL-pFnt?usp=sharing
以下是我们使用的文件x1c 0d1x

我尝试使用以下脚本从一开始就将所有CSV附加到一起,但收到了大量错误

import pandas as pd
import glob
import os

# setting the path for joining multiple files
files = os.path.join("D:/User Data/Dropbox/2022-10-19 Semester/StudentFiles/StudentDataFiles/Data Files/", "*.csv")

# list of merged files returned
files = glob.glob(files)

print(files);

# joining files with concat and read_csv
df = pd.concat(map(pd.read_csv, files), ignore_index=True)
print(df)

脚本一直运行到files = glob.glob(files)部分,因为我可以打印结果。

Traceback (most recent call last):
  File "D:/User Data/Dropbox/2022-10-19 Semester/StudentFiles/StudentDataFiles/IN498_M2_2.py", line 14, in <module>
    df = pd.concat(map(pd.read_csv, files), ignore_index=True)
  File "C:\Users\KDPen\anaconda3\lib\site-packages\pandas\util\_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "C:\Users\KDPen\anaconda3\lib\site-packages\pandas\core\reshape\concat.py", line 347, in concat
    op = _Concatenator(
  File "C:\Users\KDPen\anaconda3\lib\site-packages\pandas\core\reshape\concat.py", line 401, in __init__
    objs = list(objs)
  File "C:\Users\KDPen\anaconda3\lib\site-packages\pandas\util\_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "C:\Users\KDPen\anaconda3\lib\site-packages\pandas\io\parsers\readers.py", line 680, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "C:\Users\KDPen\anaconda3\lib\site-packages\pandas\io\parsers\readers.py", line 575, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "C:\Users\KDPen\anaconda3\lib\site-packages\pandas\io\parsers\readers.py", line 933, in __init__
    self._engine = self._make_engine(f, self.engine)
  File "C:\Users\KDPen\anaconda3\lib\site-packages\pandas\io\parsers\readers.py", line 1235, in _make_engine
    return mapping[engine](f, **self.options)
  File "C:\Users\KDPen\anaconda3\lib\site-packages\pandas\io\parsers\c_parser_wrapper.py", line 75, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas\_libs\parsers.pyx", line 544, in pandas._libs.parsers.TextReader.__cinit__
  File "pandas\_libs\parsers.pyx", line 633, in pandas._libs.parsers.TextReader._get_header
  File "pandas\_libs\parsers.pyx", line 847, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas\_libs\parsers.pyx", line 1952, in pandas._libs.parsers.raise_parser_error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

我认为这些错误可能与CSV文件的数据布局或结构有关,因此,我尝试从简单地将单个CSV文件上传到python Dataframe 并向其添加标题开始。然而,当我尝试这样做时,数据似乎被识别为只有一列,如下所示,就像这些列没有被分隔符分隔,即使它是逗号分隔的。这应该是panda中本机可读的。所以,我想问题可能是缺少标头,或者数据集中缺少值之类的,但我不知道是什么导致了这些问题....我尝试了多种方法来解决这个问题,但都没有用。我尝试在read_csv函数中使用各种属性,包括:名称=信头清单、编码、信头=无、keep_default_na=False、sep ='、'、skiprows=[0],以及其他一些。

import pandas as pd
import glob
import os
import csv

headerslist = ['Date','Package_Name','Country','Store_Listing_Visitors','Installers','Visitor-to-Installer_conversion_rate','Installers_retained_for_1_day','Installer-to-1_day_retention_rate','Installers_retained_for_7_days','Installer-to-7_days_retention_rate','Installers_retained_for_15_days','Installer-to-15_days_retention_rate','Installers_retained_for_30_days','Installer-to-30_days_retention_rate']

df = pd.read_csv('D:/User Data/Dropbox/2022-10-19 Semester/StudentFiles/StudentDataFiles/Data Files/retained_installers_com.foo.bar_201904_country.csv', keep_default_na=False, sep=',',  skiprows=[0], delimiter=None,  header=None, encoding='cp1252')

df2 = headerslist.append(df)

print(df)

我尝试了Adrian的解决方案,但返回了一个奇怪的错误

import csv
import glob
import os

files = os.path.join("D:/User Data/Dropbox/Kristophers Files/School (Purdue Global)/2022-10-19 Semester/IN498 - Capstone/StudentFiles/StudentDataFiles/Data Files/", "*.csv")

# list of merged files returned
files = glob.glob(files)

header = None
new_file = []
for f in (files):
    with open(f, newline='') as csv_file:
        reader = csv.reader(csv_file)
        if not header:
            new_file.append(next(reader))
            header = True
        else:
            next(reader)
        for row in reader:
            new_file.append(row)

with open('CombinedCSV.csv', 'w', newline='') as csv_out:
    writer = csv.writer(csv_out)
    writer.writerows(new_file)

错误/追溯:

C:\Users\KDPen\anaconda3\python.exe "D:\User Data\Dropbox\2022-10-19 Semester\IN498_M2_3.py" 
Traceback (most recent call last):
  File "D:\User Data\Dropbox\2022-10-19 Semester\IN498_M2_3.py", line 20, in <module>
    next(reader)
_csv.Error: line contains NUL

Process finished with exit code 1
yrdbyhpb

yrdbyhpb1#

这是因为有些文件是用UTF-16编码的,你可以找到一个编码,并在调用pandas.read_csv()时指定这个编码。

import codecs
import pandas as pd

def find_encoding(path):
    with open(path, 'rb') as f:
        bytes = f.read(4)
    for bom, encoding in (
        (codecs.BOM_UTF8, 'utf-8-sig'),
        (codecs.BOM_UTF32_LE, 'utf-32'),
        (codecs.BOM_UTF32_BE, 'utf-32'),
        (codecs.BOM_UTF16_LE, 'utf-16'),
        (codecs.BOM_UTF16_BE, 'utf-16'),
    ):
        if bytes.startswith(bom):
            return encoding
    return None
...
def read_csv(path):
    return pd.read_csv(path, encoding=find_encoding(path), header=0)
df = pd.concat(map(read_csv, files), ignore_index=True)

而且您的文件在行首和行尾包含额外的双引号。要处理它们,请像这样做,而不是上面的read_csv()

def read_csv(path):
    lines = []
    with open(path, 'rt', encoding=find_encoding(path)) as f:
        for line in f:
            lines.append(line.rstrip()[1:-1])
    return pd.read_csv(io.StringIO('\n'.join(lines)), header=0)

相关问题