如何通过逗号分隔符和双引号将文本转换为CSV?[关闭]

ar5n3qh5  于 2023-06-27  发布在  其他
关注(0)|答案(4)|浏览(165)

已关闭,此问题需要details or clarity。目前不接受答复。
**想改善这个问题吗?**通过editing this post添加详细信息并澄清问题。

2天前关闭。
Improve this question
这是我的当前字符串值:

"""
    |name, model, os
    |A,"I PAD (10.0"", 2020, Wi-Fi)",OS_A
"""

我希望输出像下面这样,并最终保存为csv:
| 姓名|模型|奥斯|
| - -----|- -----|- -----|
| 一个|I PAD(10.0”",2020,Wi-Fi)|OS|
我被绊倒了,因为在模型字段中,字符串里面有逗号和双引号。我目前的想法是正则化任何有问题的文本,但有没有其他解决方案?

g52tjvyc

g52tjvyc1#

假设输入数据的格式是一致的,我们可以使用创造性的可迭代解包来容忍中间列中的,。只要外部列不包含逗号,我们就可以使用pandas.to_csv()来编写csv

import pandas as pd

input_string = '''
    |name, model, os
    |A,"I PAD (10.0"", 2020, Wi-Fi)",OS_A
    |B,"I PAD (10.0"", 2020, Wi-Fi)",OS_B
'''

lines = [line.strip().strip('|').split(',') for line in input_string.strip().split('\n')]
(name,*model,os) = lines[0] 
header= (name,','.join(model),os)

lines= [(name,','.join(model).strip('"'),os) for (name,*model,os) in lines[1:]]
pd.DataFrame(lines,columns=header).to_csv('data.csv',index=False)

输出DataFrame

name    model   os
0   A   I PAD (10.0"", 2020, Wi-Fi) OS_A
1   B   I PAD (10.0"", 2020, Wi-Fi) OS_B

csv.read_csvquotechar看起来非常强大,代码读起来也很不错:

import csv

string = \
"""
    |name, model, os
    |A,"I PAD (10.0"", 2020, Wi-Fi)",OS_A
    |B,"I PAD (10.0"", 2020, Wi-Fi)",OS_B
    |C,"I PAD (10.0"", 2020, Wi-Fi)",OS_C
    |D,"I PAD (10.0"", 2020, Wi-Fi)",OS_D
"""

reader = csv.reader([line.lstrip(' |\t') for line in string.splitlines()], quotechar='"')
header = None
while not header:
    header = next(reader)
pd.DataFrame(reader, columns=header).to_csv('name.csv',index=False)

但这确实会扰乱outut ""字符:

name    model   os
0   A   I PAD (10.0", 2020, Wi-Fi)  OS_A
1   B   I PAD (10.0", 2020, Wi-Fi)  OS_B
2   C   I PAD (10.0", 2020, Wi-Fi)  OS_C
3   D   I PAD (10.0", 2020, Wi-Fi)  OS_D
ttvkxqim

ttvkxqim2#

https://onlinegdb.com/cslea1uYz

import csv

string = \
"""
    |name, model, os
    |A,"I PAD (10.0"", 2020, Wi-Fi)",OS_A
    |B,"I PAD (10.0"", 2020, Wi-Fi)",OS_B
    |C,"I PAD (10.0"", 2020, Wi-Fi)",OS_C
    |D,"I PAD (10.0"", 2020, Wi-Fi)",OS_D
"""

reader = csv.reader(string.splitlines(), quotechar='"')
with open('output.csv', 'w', newline=None) as file:
    writer = csv.writer(file, quotechar='"')
    for row in reader:
        if not row: continue
        row = (i.strip(' |') for i in row)
        writer.writerow(row)
neskvpey

neskvpey3#

如果我没理解你的问题,这应该能帮你找到你想要的。

import pandas as pd
from io import StringIO
import re

string = '''
    |name, model, os
    |A,"I PAD (10.0"", 2020, Wi-Fi)",OS_A
    |B,"I PAD (10.0"", 2020, Wi-Fi)",OS_B
    |C,"I            PAD (10.0"", 2020, Wi-Fi)", OS_C
'''

string = re.sub("[|]", "", string)
string = re.sub(" ", "", string)

df = pd.read_csv(StringIO(string))
print(df)

下面是输出:

name                   model    os
0    A  IPAD(10.0",2020,Wi-Fi)  OS_A
1    B  IPAD(10.0",2020,Wi-Fi)  OS_B
2    C  IPAD(10.0",2020,Wi-Fi)  OS_C

这假设与输入一致,所以如果所有输入字符串都有点不同,可能需要添加一些东西。
如果您希望保留一个空间,请使用string = re.sub(" +", " ", string)

sg24os4d

sg24os4d4#

下面是使用PySpark的解决方案,我使用Spark 3.4和Python 3.11执行此示例。
创建包含以下内容的input.csv文件。
名称、型号、操作系统
A,“I PAD(10.0”",2020,Wi-Fi)",OS_A

PySpark代码:

import pyspark
from pyspark.sql import SparkSession

# Create SparkSession
spark=SparkSession.builder.getOrCreate()

file_df=spark.read.csv("input.csv",header=True,quote='\"',escape='\"')

file_df.show(truncate=False)
#+----+--------------------------+----+
#|name| model                    | os |
#+----+--------------------------+----+
#|A   |I PAD (10.0", 2020, Wi-Fi)|OS_A|
#+----+--------------------------+----+

相关问题