pandas 如何使用JSON列拆分csv数据

7z5jn7bk  于 2023-09-29  发布在  其他
关注(0)|答案(2)|浏览(119)

我的数据(.csv)看起来像这样:

col1, col2, col3, col4, col5, col6
XXX, BBB, "[{\"col31\":\"XXX\",\"col32\":\"XXX\",\"col33\":1}, {\"col31\":\"XXX\",\"col32\":\"XXX\",\"col33\":1}]", DDD, EEE, GGG
XXX, BBB, "[{\"col31\":\"XXX\",\"col32\":\"XXX\",\"col33\":1}]", DDD, , GGG

当我使用pd.read_csv()时,结果将是:
result from my attempt
我希望table是:
| col1| col2| col3| col4| col5| col6|
| --|--|--|--|--|--|
| XXX| BBB| [{“col31”:“XXX”,“col32”:“XXX”,"col33:1},{“col31”:“XXX”,“col32”:“XXX”,“col33”:1}]| DDD| EEE| GGG|
| XXX| BBB| [{“col31”:“XXX”,“col32”:“XXX”,col33:1}]| DDD||GGG|
而第3列是字典列表格式。

nwo49xxi

nwo49xxi1#

看起来你在引号/转义方面有问题,这可以由pandas从头开始处理,但你必须将正确的参数传递给read_csv,因为默认参数对你不起作用:

  • sep=','指定字段分隔符(默认)
  • quotechar='"'表示"是引号(默认)
    *escapechar='\\'表示\正在转义"中的一些
    *skipinitialspace=True忽略,分隔符后的空格
df = pd.read_csv('your_file.csv', sep=',', quotechar='"',
                 escapechar='\\', skipinitialspace=True)

输出量:

col1 col2                                                                                col3 col4 col5 col6
0  XXX  BBB  [{"col31":"XXX","col32":"XXX","col33":1}, {"col31":"XXX","col32":"XXX","col33":1}]  DDD  EEE  GGG
1  XXX  BBB                                           [{"col31":"XXX","col32":"XXX","col33":1}]  DDD  NaN  GGG

注意,这将导入“字典列表”作为字符串,如果你需要python对象,你可以进一步用途:

import ast

df['col3'] = df['col3'].map(ast.literal_eval)
h6my8fg2

h6my8fg22#

import pandas as pd
import json

# Raw data
data = """
col1, col2, col3, col4, col5, col6
XXX, BBB, "[{\"col31\":\"XXX\",\"col32\":\"XXX\",\"col33\":1}, {\"col31\":\"XXX\",\"col32\":\"XXX\",\"col33\":1}]", DDD, EEE, GGG
XXX, BBB, "[{\"col31\":\"XXX\",\"col32\":\"XXX\",\"col33\":1}]", DDD, , GGG
"""

def split_outside_brackets(line):
    """Split a string by commas that are outside of square brackets."""
    parts = []
    start = 0
    bracket_depth = 0
    for i, char in enumerate(line):
        if char == '[':
            bracket_depth += 1
        elif char == ']':
            bracket_depth -= 1
        elif char == ',' and bracket_depth == 0:
            parts.append(line[start:i].strip())
            start = i + 1
    parts.append(line[start:].strip())
    return parts

# Split the data into lines
lines = data.strip().split('\n')

# Process each line
processed_lines = [split_outside_brackets(line) for line in lines]

# Convert to a DataFrame
df_result = pd.DataFrame(processed_lines[1:], columns=processed_lines[0])

# Correct the single quotes around col33 and decode the JSON
df_result['col3'] = df_result['col3'].str[1:-1].apply(json.loads)

df_result

相关问题