在我的一个课程中,我们需要使用python加载和附加6个CSV文件。这些文件没有包含头。我已经尝试了几个小时,多种方法要么直接组合CSV文件,要么单独阅读它们并附加它们,使用不同的在线指南。对于什么应该是一个简单的任务,我遇到了许多问题。当我尝试一种方法,包括加入文件的权利出了大门,我得到了许多错误信息的回应。
为了确认,我需要执行以下初始步骤:
- 将CSV文件从本地目录加载到 Dataframe
- 添加CSV中缺少的提供的标头
- 将6个CSV文件一起追加到一个整合的数据集中
- (不一定按此顺序)*
数据文件位于此处,用于再现。https://drive.google.com/drive/folders/1ZKBFbsUBNUhsWtVtsMqOtXKx4SL-pFnt?usp=sharing
以下是我们使用的文件x1c 0d1x
我尝试使用以下脚本从一开始就将所有CSV附加到一起,但收到了大量错误
import pandas as pd
import glob
import os
# setting the path for joining multiple files
files = os.path.join("D:/User Data/Dropbox/2022-10-19 Semester/StudentFiles/StudentDataFiles/Data Files/", "*.csv")
# list of merged files returned
files = glob.glob(files)
print(files);
# joining files with concat and read_csv
df = pd.concat(map(pd.read_csv, files), ignore_index=True)
print(df)
脚本一直运行到files = glob.glob(files)部分,因为我可以打印结果。
Traceback (most recent call last):
File "D:/User Data/Dropbox/2022-10-19 Semester/StudentFiles/StudentDataFiles/IN498_M2_2.py", line 14, in <module>
df = pd.concat(map(pd.read_csv, files), ignore_index=True)
File "C:\Users\KDPen\anaconda3\lib\site-packages\pandas\util\_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "C:\Users\KDPen\anaconda3\lib\site-packages\pandas\core\reshape\concat.py", line 347, in concat
op = _Concatenator(
File "C:\Users\KDPen\anaconda3\lib\site-packages\pandas\core\reshape\concat.py", line 401, in __init__
objs = list(objs)
File "C:\Users\KDPen\anaconda3\lib\site-packages\pandas\util\_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "C:\Users\KDPen\anaconda3\lib\site-packages\pandas\io\parsers\readers.py", line 680, in read_csv
return _read(filepath_or_buffer, kwds)
File "C:\Users\KDPen\anaconda3\lib\site-packages\pandas\io\parsers\readers.py", line 575, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "C:\Users\KDPen\anaconda3\lib\site-packages\pandas\io\parsers\readers.py", line 933, in __init__
self._engine = self._make_engine(f, self.engine)
File "C:\Users\KDPen\anaconda3\lib\site-packages\pandas\io\parsers\readers.py", line 1235, in _make_engine
return mapping[engine](f, **self.options)
File "C:\Users\KDPen\anaconda3\lib\site-packages\pandas\io\parsers\c_parser_wrapper.py", line 75, in __init__
self._reader = parsers.TextReader(src, **kwds)
File "pandas\_libs\parsers.pyx", line 544, in pandas._libs.parsers.TextReader.__cinit__
File "pandas\_libs\parsers.pyx", line 633, in pandas._libs.parsers.TextReader._get_header
File "pandas\_libs\parsers.pyx", line 847, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas\_libs\parsers.pyx", line 1952, in pandas._libs.parsers.raise_parser_error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
我认为这些错误可能与CSV文件的数据布局或结构有关,因此,我尝试从简单地将单个CSV文件上传到python Dataframe 并向其添加标题开始。然而,当我尝试这样做时,数据似乎被识别为只有一列,如下所示,就像这些列没有被分隔符分隔,即使它是逗号分隔的。这应该是panda中本机可读的。所以,我想问题可能是缺少标头,或者数据集中缺少值之类的,但我不知道是什么导致了这些问题....我尝试了多种方法来解决这个问题,但都没有用。我尝试在read_csv函数中使用各种属性,包括:名称=信头清单、编码、信头=无、keep_default_na=False、sep ='、'、skiprows=[0],以及其他一些。
import pandas as pd
import glob
import os
import csv
headerslist = ['Date','Package_Name','Country','Store_Listing_Visitors','Installers','Visitor-to-Installer_conversion_rate','Installers_retained_for_1_day','Installer-to-1_day_retention_rate','Installers_retained_for_7_days','Installer-to-7_days_retention_rate','Installers_retained_for_15_days','Installer-to-15_days_retention_rate','Installers_retained_for_30_days','Installer-to-30_days_retention_rate']
df = pd.read_csv('D:/User Data/Dropbox/2022-10-19 Semester/StudentFiles/StudentDataFiles/Data Files/retained_installers_com.foo.bar_201904_country.csv', keep_default_na=False, sep=',', skiprows=[0], delimiter=None, header=None, encoding='cp1252')
df2 = headerslist.append(df)
print(df)
我尝试了Adrian的解决方案,但返回了一个奇怪的错误
import csv
import glob
import os
files = os.path.join("D:/User Data/Dropbox/Kristophers Files/School (Purdue Global)/2022-10-19 Semester/IN498 - Capstone/StudentFiles/StudentDataFiles/Data Files/", "*.csv")
# list of merged files returned
files = glob.glob(files)
header = None
new_file = []
for f in (files):
with open(f, newline='') as csv_file:
reader = csv.reader(csv_file)
if not header:
new_file.append(next(reader))
header = True
else:
next(reader)
for row in reader:
new_file.append(row)
with open('CombinedCSV.csv', 'w', newline='') as csv_out:
writer = csv.writer(csv_out)
writer.writerows(new_file)
错误/追溯:
C:\Users\KDPen\anaconda3\python.exe "D:\User Data\Dropbox\2022-10-19 Semester\IN498_M2_3.py"
Traceback (most recent call last):
File "D:\User Data\Dropbox\2022-10-19 Semester\IN498_M2_3.py", line 20, in <module>
next(reader)
_csv.Error: line contains NUL
Process finished with exit code 1
1条答案
按热度按时间yrdbyhpb1#
这是因为有些文件是用UTF-16编码的,你可以找到一个编码,并在调用
pandas.read_csv()
时指定这个编码。而且您的文件在行首和行尾包含额外的双引号。要处理它们,请像这样做,而不是上面的
read_csv()
。