尝试合并CSV文件,但获得编码错误

g52tjvyc  于 2023-05-04  发布在  其他
关注(0)|答案(1)|浏览(147)

我有大约240个CSV文件,其中一些非常大。它们都包含类似的信息,基本上,我只是试图将它们合并在一起,但我在阅读文件内容时出现编码错误/错误。
下面是我得到的错误消息:

Traceback (most recent call last):
  File "C:\Users\ethan\Desktop\filemerger.py", line 23, in <module>
    for row in reader:
_csv.Error: line contains NUL

Process finished with exit code 1

以下是我的代码,任何帮助都将不胜感激:

import os
import csv
import chardet

directory_path = r"A:\FilesMerge"

header_dict = {}

data_rows = []

for filename in os.listdir(directory_path):
    if filename.endswith(".csv"):
        file_path = os.path.join(directory_path, filename)
        with open(file_path, 'r', errors="ignore") as csvfile:
            reader = csv.reader(csvfile)
            headers = next(reader)
            for header in headers:
                if header not in header_dict.keys():
                    header_dict[header] = len(header_dict)
                else:
                    header_dict[header] = min(header_dict[header], len(header_dict)-1)
            data_rows.extend([[],[],[],[filename]])
            for row in reader:
                if len(row) == len(headers):
                    data_rows.append(row)
                else:
                    new_row = [''] * len(headers)
                    new_row[:len(row)] = row
                    data_rows.append(new_row)

sorted_headers = sorted(header_dict.keys(), key=lambda x: header_dict[x])

with open(os.path.join(directory_path, "merged_headers_and_data.csv"), 'rb') as f:
    result = chardet.detect(f.read())

with open(os.path.join(directory_path, "merged_headers_and_data.csv"), 'w', newline='', encoding=result['encoding'], errors='ignore') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(sorted_headers)
    writer.writerows(data_rows)
qgzx9mmu

qgzx9mmu1#

最近在3.11.0+中从csv模块中删除了此错误。你可以看到长时间的讨论,为什么,在这里,Issue 27580;在补丁中删除的特定错误消息。
如果可以的话,升级到Python 3.11.0+并继续:你可以让文件保持原样,让解码器自己处理。
如果你不能升级你的Python,我相信你唯一的出路就是清理多余的NULL字节。

查找并清理CSV中的NULL字节

如果我处在你的位置,我想我会首先试着描述问题的本质:有多少个文件有多少行,每个文件有NULL字节,当它们都是UTF-8时?
我在CSV_files目录中模拟了几个CSV:

input1.csv
input2.csv
input3.csv
input4.csv

但在我做任何事情之前,我会将整个目录复制到一个“工作”目录,并在其中执行其他所有操作-让原始目录从任何错误中恢复。
下面的脚本尝试以UTF-8格式读取该目录中的每个CSV文件(纯文本),并报告一些NULL字节的统计信息和任何错误。它使用csv模块,但仅用于编写报告。。它不会尝试解码CSV:

import glob

import csv
import sys

NULL_BYTE = "\x00"

writer = csv.writer(sys.stdout)
writer.writerow(["File", "Lines w/NULL", "NULL count", "Error"])

filenames = glob.glob(r"CSV_files_working/*.csv")

for fname in sorted(filenames):
    line_ct = 0
    nulls_ct = 0
    error = ""

    try:
        with open(fname, encoding="utf-8") as f_in:
            for i, line in enumerate(f_in, start=1):
                if NULL_BYTE in line:
                    line_ct += 1
                    nulls_ct += line.count(NULL_BYTE)

    except Exception as e:
        error = str(e)

    writer.writerow([fname, line_ct, nulls_ct, error])

我会将输出的CSV报告放入电子表格,看看我在处理什么:
| 文件|包含NULL的行|空计数|错误|
| --------------|--------------|--------------|--------------|
| CSV_files_working/input1.csv|0|0||
| CSV_files_working/input2.csv|三|六||
| CSV_files_working/input3.csv|三|三||
| CSV_files_working/input4.csv|0|0|'utf-8'编解码器无法解码位置0中的字节0xff:无效起始字节|
然后,我会在文本编辑器或电子表格中打开任何非UTF-8文件,并将它们保存为UTF-8。当我在VSCode中打开input4.csv时,它(正确地)猜到编码是UTF-16。我可以在VSCode中将文件保存为UTF-8。如果我有很多文件,并且可以猜测它们当前的编码,我会运行一个脚本将非UTF-8文件转换为UTF-8:

enc_filenames = [
    ("utf-16", "CSV_files_working/input4.csv"),
]

for input_enc, fname in enc_filenames:
    with open(fname, encoding=input_enc) as f_in:
        data = f_in.read()

    with open(fname, "w", encoding="utf-8") as f_out:
        f_out.writelines(data)

再次运行报告python3 find-NULs.py > report.csv
| 文件|包含NULL的行|空计数|错误|
| --------------|--------------|--------------|--------------|
| CSV_files_working/input1.csv|0|0||
| CSV_files_working/input2.csv|三|六||
| CSV_files_working/input3.csv|三|三||
| CSV_files_working/input4.csv|1|1||
然后我可能会想看看NULL到底是什么样子...它们应该被转换成其他字符,还是可以被删除?

NULL_BYTE = "\x00"

filenames = [
    "CSV_files_working/input2.csv",
    "CSV_files_working/input3.csv",
    "CSV_files_working/input4.csv",
]

for fname in filenames:
    with open(fname, encoding="utf-8") as f_in:
        print(f"{fname}:")
        for i, line in enumerate(f_in, start=1):
            if NULL_BYTE in line:
                print(f"  line {i:>04}: {repr(line)}")
CSV_files_working/input2.csv:
  line 0002: 'f2r1c1\x00,f2r1c2\x00\n'
  line 0003: 'f2r2c1\x00,f2r2c2\x00\n'
  line 0004: 'f2r3c1\x00,f2r3c2\x00\n'
CSV_files_working/input3.csv:
  line 0002: 'f3r1c1,f3r1c2\x00\n'
  line 0003: 'f3r2c1,f3r2c2\x00\n'
  line 0004: 'f3r3c1,f3r3c2\x00\n'
CSV_files_working/input4.csv:
  line 0004: 'f4r3c1,f4r3c2\x00\n'

我的数据中的所有NULL都位于字段的末尾(而不是字段中的数据之间),因此我可以安全地删除这些NULL:

NULL_BYTE = "\x00"

filenames = [
    "CSV_files_working/input2.csv",
    "CSV_files_working/input3.csv",
    "CSV_files_working/input4.csv",
]

for fname in filenames:
    with open(fname, encoding="utf-8") as f_in:
        data = [line.replace(NULL_BYTE, "") for line in f_in]

    with open(fname, "w", encoding="utf-8") as f_out:
        f_out.writelines(data)

    print(f"cleaned {fname}")
cleaned CSV_files_working/input2.csv
cleaned CSV_files_working/input3.csv
cleaned CSV_files_working/input4.csv

再次运行报告,所有文件都已清理:
| 文件|包含NULL的行|空计数|错误|
| --------------|--------------|--------------|--------------|
| CSV_files_working/input1.csv|0|0||
| CSV_files_working/input2.csv|0|0||
| CSV_files_working/input3.csv|0|0||
| CSV_files_working/input4.csv|0|0||
您的数据可能不允许仅替换NULL。您可能需要将每一个转换为空格(line.replace(NULL_BYTE, " ")),然后在解码CSV文件后处理额外的空格。

合并CSV

最后,我想推荐一种不同的方法来完成更大的任务,那就是合并不同的CSV和不同的头文件。我认为你可以通过使用DictReader,特别是DictWriter已经拥有的功能来做你想做的所有事情,而不是自己管理标题的差异。
我的模拟CSV文件具有以下标题:

input1.csv
Col__1,Col__2
input2.csv
Col__3,Col__2
input3.csv
Col__4,Col__7
input4.csv
Col__1,Col9

我想合并这四个文件,并期望最终的头是:

Col9,Col__1,Col__2,Col__3,Col__4,Col__7

我可以通过为DictWriter提供所有字段名并告诉它在写入器遇到没有所有字段的行时为每个字段插入一个空字符串(“”)来获得最终输出,restval=“”

import csv
import glob

filenames = glob.glob(r"CSV_files_working/*.csv")

fieldnames: set[str] = set()
all_rows: list[dict[str, str]] = []

for fname in sorted(filenames):
    with open(fname, encoding="utf-8", newline="") as f_in:
        reader = csv.DictReader(f_in)

        if reader.fieldnames is None:
            print(f"{fname} doesn't have fieldnames!")
            exit(1)

        fieldnames.update(reader.fieldnames)

        all_rows.extend(reader)

# Sort first by len of the fieldname, then sort lexically
final_fieldnames = sorted(fieldnames, key=lambda x: (len(x), x))

with open("output.csv", "w", encoding="utf-8", newline="") as f_out:
    writer = csv.DictWriter(f_out, fieldnames=final_fieldnames, restval="")
    writer.writeheader()
    writer.writerows(all_rows)

我得到:

| Col9   | Col__1 | Col__2 | Col__3 | Col__4 | Col__7 |
|--------|--------|--------|--------|--------|--------|
|        | f1r1c1 | f1r1c2 |        |        |        |
|        | f1r2c1 | f1r2c2 |        |        |        |
|        | f1r3c1 | f1r3c2 |        |        |        |
|        |        | f2r1c2 | f2r1c1 |        |        |
|        |        | f2r2c2 | f2r2c1 |        |        |
|        |        | f2r3c2 | f2r3c1 |        |        |
|        |        |        |        | f3r1c1 | f3r1c2 |
|        |        |        |        | f3r2c1 | f3r2c2 |
|        |        |        |        | f3r3c1 | f3r3c2 |
| f4r1c2 | f4r1c1 |        |        |        |        |
| f4r2c2 | f4r2c1 |        |        |        |        |
| f4r3c2 | f4r3c1 |        |        |        |        |

相关问题