csv Pandas和制表符分隔的文件,字段内带有双引号

yzxexxkh  于 9个月前  发布在  其他
关注(0)|答案(3)|浏览(115)

我有一个以制表符分隔格式“test.exp”从程序中导出的数据:

"HEADER"    "ID"    "Part Reference"    "Value" "Part_Description"
"PARTOCC:17306" "17306" "M1"    "48SL-5S-50-C-LF"   "Series 48SL–5 WEDGE–LOK, 2-56UNC-2B, 5.00", chem film, lock washer and flat"
"PARTOCC:17310" "17310" "M2"    "48SL-5S-50-C-LF"   "Series 48SL–5 WEDGE–LOK, 2-56UNC-2B, 5.00", chem film, lock washer and flat"
"PARTOCC:65494" "65494" "J4E"   "311P822-MC-095-BS-D"   "GSFC CPCI J3 RA MALE 95 Position 0.123" tails"

字符串
注意,在这个程序的输出格式中没有选项,我们有很多这样的文件要摄取到零件数据库中。注意,5.00”嵌入在制表符分隔的文件中。
把这个文件放在一个框架中,然后把它写回来:

import pandas as pd

exp_fn = r"test.exp"
exp_df = pd.read_csv(exp_fn, sep='\t', dtype=str, encoding='mbcs')
exp_df.to_csv(r"check.exp", sep='\t')


对于这个测试用例,我们希望check.exp文件与test.exp文件匹配。使用IDE,您可以检查“读取后的框架”,看看5.00”发生了什么。
我一直无法找到正确的语法,无论是读还是写,都不会损坏嵌入的引号。
简而言之,我把这个发到了一个领域:

"Series 48SL–5 WEDGE–LOK, 2-56UNC-2B, 5.00", chem film, lock washer and flat"


但我回来了:"Series 48SL–5 WEDGE–LOK, 2-56UNC-2B, 5.00, chem film, lock washer and flat"""
我希望使输出匹配的输入制表符分隔的文件,输入和输出格式相同。

3duebb1j

3duebb1j1#

如果您需要该示例行最终看起来像这样:

['PARTOCC:17306', '17306', 'M1', '48SL-5S-50-C-LF', 'Series 48SL–5 WEDGE–LOK, 2-56UNC-2B, 5.00", chem film, lock washer and flat']

字符串

  • 你肯定知道 *:

1.字段总是有一个开始和结束(* 封闭 *)双引号
1.字段不包含选项卡作为值(需要用引号括起来)
1.字段不包含任何换行符作为值(也需要用引号括起来)
你可以尝试在解析过程中忽略引号,然后用一些简单的字符串切片手动删除封闭的引号。
最简单的方法看起来像:

import csv

rows: list[list[str]] = []
with open("input.txt", newline="") as f:
    reader = csv.reader(f, delimiter="\t", quoting=csv.QUOTE_NONE)

    header = [x[1:-1] for x in next(reader)]  # consume, fix, and save header

    for row in reader:
        rows.append([x[1:-1] for x in row])

print(rows)


我在问题中的多行样本上运行了它,我得到:

[
    ['PARTOCC:17306', '17306', 'M1', '48SL-5S-50-C-LF', 'Series 48SL–5 WEDGE–LOK, 2-56UNC-2B, 5.00", chem film, lock washer and flat'],
    ['PARTOCC:17310', '17310', 'M2', '48SL-5S-50-C-LF', 'Series 48SL–5 WEDGE–LOK, 2-56UNC-2B, 5.00", chem film, lock washer and flat'],
    ['PARTOCC:65494', '65494', 'J4E', '311P822-MC-095-BS-D', 'GSFC CPCI J3 RA MALE 95 Position 0.123" tails'],
]


你的要求,然后写出来的解析和固定的结果,因为你得到他们没有意义给我。如果你只是需要解析和摄取,停在那里。为什么要尝试写一个错误编码的TSV文件?也许写最终正确的编码,以帮助未来的程序员/操作?

with open("output.txt", "w", newline="") as f:
    writer = csv.writer(f, delimiter="\t")
    writer.writerow(header)
    writer.writerows(rows)
HEADER  ID  Part Reference  Value   Part_Description
PARTOCC:17306   17306   M1  48SL-5S-50-C-LF "Series 48SL–5 WEDGE–LOK, 2-56UNC-2B, 5.00"", chem film, lock washer and flat"
PARTOCC:17310   17310   M2  48SL-5S-50-C-LF "Series 48SL–5 WEDGE–LOK, 2-56UNC-2B, 5.00"", chem film, lock washer and flat"
PARTOCC:65494   65494   J4E 311P822-MC-095-BS-D "GSFC CPCI J3 RA MALE 95 Position 0.123"" tails"

更新:input == output

考虑你的注解和测试输入与输出;下面的代码试图解释解析错误TSV的责任。用于解析的代码化值将再次用于尝试和重新创建错误编码。我相信这个程序应该证明解码和编码的代码化思想是正确的,并且如果输出等于输入,那么数据在其间没有改变:

import sys

END = "\n"
"""expected line ending"""
DELIM = "\t"
"""expected delimiter"""
QUOT = '"'
"""expected quoting character"""

Row = list[str]

def decode(line: str) -> Row:
    """
    Decodes a badly encoded CSV/TSV line.

    Ensures line ends with END, split line on DELIM, and verify that
    each item (field) from the split begins and ends with QUOT then trim
    those enclosing QUOTs.

    Returns the fields as a new row.
    """
    if line[-len(END)] != END:
        raise ValueError(f"line doesn't end with {repr(END)}")

    line = line[: -len(END)]
    fields = line.split(DELIM)

    for i, field in enumerate(fields):
        if field[0] != QUOT or field[-1] != QUOT:
            raise ValueError(f"field {i} not enclosed in {repr(QUOT)}: {repr(field)}")

        fields[i] = field[1:-1]

    return fields

def encode(row: Row) -> str:
    """
    Encodes row back to its original, bad encoding.
    """
    line = DELIM.join([f"{QUOT}{field}{QUOT}" for field in row])
    return line + END

def exit_err(msg: str):
    print(msg, file=sys.stderr)
    sys.exit(1)

rows: list[Row] = []
with open("input.txt") as f:
    try:
        header = decode(next(f))

    except ValueError as e:
        exit(f"couldn't decode header: {e}")

    for i, line in enumerate(f):
        try:
            rows.append(decode(line))

        except ValueError as e:
            exit(f"couldn't decode line {i}: {e}")

with open("output.txt", "w") as f:
    f.write(encode(header))

    for row in rows:
        f.write(encode(row))

我这样运行:

python3 main.py; diff input.txt output.txt


我什么也得不到,这意味着输出等于输入。

更新:使用Andrej方法的往返

import csv
import re
from io import StringIO
import pandas as pd

with open("input.txt") as f:
    text = f.read()

text = re.sub(r'(\d+\.\d+)"', r'\g<1>\\"', text)
df = pd.read_csv(StringIO(text), quotechar='"', sep="\t", escapechar="\\")

out_f = StringIO()

df.to_csv(
    out_f,
    sep="\t",
    quoting=csv.QUOTE_ALL,
    doublequote=False,
    escapechar="\\",
    index=False,
)

text = out_f.getvalue().replace(r"\"", '"')

with open("output.txt", "w") as f:
    f.write(text)
lg40wkob

lg40wkob2#

查看您的文件,如果没有一些预处理,将很难解析。您可以使用re模块并手动添加转义字符,例如:

import re
from io import StringIO

import pandas as pd

text = """\
"HEADER"    "ID"    "Part Reference"    "Value" "Part_Description"
"PARTOCC:17306" "17306" "M1"    "48SL-5S-50-C-LF"   "Series 48SL–5 WEDGE–LOK, 2-56UNC-2B, 5.00", chem film, lock washer and flat"
"PARTOCC:17310" "17310" "M2"    "48SL-5S-50-C-LF"   "Series 48SL–5 WEDGE–LOK, 2-56UNC-2B, 5.00", chem film, lock washer and flat"
"PARTOCC:65494" "65494" "J4E"   "311P822-MC-095-BS-D"   "GSFC CPCI J3 RA MALE 95 Position 0.123" tails"
"""

text = re.sub(r'(\d+\.\d+)"', r'\g<1>\\"', text)

df = pd.read_csv(StringIO(text), quotechar='"', sep=r"\s+", escapechar="\\")
print(df)

字符串
打印:

HEADER     ID Part Reference                Value                                                             Part_Description
0  PARTOCC:17306  17306             M1      48SL-5S-50-C-LF  Series 48SL–5 WEDGE–LOK, 2-56UNC-2B, 5.00", chem film, lock washer and flat
1  PARTOCC:17310  17310             M2      48SL-5S-50-C-LF  Series 48SL–5 WEDGE–LOK, 2-56UNC-2B, 5.00", chem film, lock washer and flat
2  PARTOCC:65494  65494            J4E  311P822-MC-095-BS-D                                GSFC CPCI J3 RA MALE 95 Position 0.123" tails

dluptydi

dluptydi3#

这两种方法都让我陷入了困境。我遇到的问题是,使用df.to_csv()或df.read_csv()的库在有效的制表符分隔值文件中改变了嵌入的双引号。为了解决这个问题,我尝试了Zach和Andrej的方法。
为了写出嵌套框并避免使用lib代码,比如df.to_csv(),它改变了编写嵌入式双引号的方式,这基本上是不好的。迭代嵌套框行值,展平列表,遍历列表并构建一个带有“\t”分隔符的字符串,写入文件,例如。

row_list = exp_df.loc[i, :].values.flatten().tolist()

字符串
感谢您发送编修。

相关问题