pandas 解析器错误:由于json格式和分隔符相同,无法将txt文件转换为df

hc8w905p  于 2022-12-02  发布在  其他
关注(0)|答案(1)|浏览(129)

我是一个新手,我在处理.txt文件,其中有一个字典。我试图pd.read_csv和创建一个 Dataframe 在Pandas。我得到了一个错误Error tokenizing data. C error: Expected 4 fields in line 2, saw 11抛出。我相信我找到了根本问题,这是文件很难阅读,因为每一行包含一个字典,其键-值对是由逗号分隔,在这种情况下是分隔符。

数据(store.txt)

id,name,storeid,report
11,JohnSmith,3221-123-555,{"Source":"online","FileFormat":0,"Isonline":true,"comment":"NAN","itemtrack":"110", "info": {"haircolor":"black", "age":53}, "itemsboughtid":[],"stolenitem":[{"item":"candy","code":1},{"item":"candy","code":1}]}
35,BillyDan,3221-123-555,{"Source":"letter","FileFormat":0,"Isonline":false,"comment":"this is the best store, hands down and i will surely be back...","itemtrack":"110", "info": {"haircolor":"black", "age":21},"itemsboughtid":[1,42,465,5],"stolenitem":[{"item":"shoe","code":2}]}
64,NickWalker,3221-123-555, {"Source":"letter","FileFormat":0,"Isonline":false, "comment":"we need this area to be fixed, so much stuff is everywhere and i     do not like this one bit at all, never again...","itemtrack":"110", "info": {"haircolor":"red", "age":22},"itemsboughtid":[1,2],"stolenitem":[{"item":"sweater","code":11},{"item":"mask","code":221},{"item":"jack,jill","code":001}]}

我该如何读取这个csv文件并根据键值创建新的列呢?另外,如果其他数据中有更多的键值呢?例如字典中有11个以上的键。
有没有一种有效的方法可以从上面的例子中创建一个df?

尝试读取为csv时的代码##

df = pd.read_csv('store.txt', header=None)

我试图导入json和用户一个转换器,但它不工作,并转换了所有的逗号到一个|′

import json
df = pd.read_csv('store.txt', converters={'report': json.loads}, header=0, sep="|")

此外我还试着用途:`

import pandas as pd
import json
df=pd.read_csv('store.txt', converters={'report':json.loads}, header=0, quotechar="'")

我也在想在字典的开头和结尾加一个引号,使它成为一个字符串,但认为这太乏味了,找不到右括号。

sqxo8psd

sqxo8psd1#

我认为在字典周围加上引号是正确的方法。你可以使用regex来做这件事,并且使用不同于"的引号字符(我在我的例子中使用了§):

from io import StringIO
import re
import json

with open("store.txt", "r") as f:
    csv_content = re.sub(r"(\{.*})", r"§\1§", f.read())

df = pd.read_csv(StringIO(csv_content), skipinitialspace=True, quotechar="§", engine="python")

df_out = pd.concat([
    df[["id", "name", "storeid"]],
    pd.DataFrame(df["report"].apply(lambda x: json.loads(x)).values.tolist())
], axis=1)

print(df_out)

注意:csv中的最后一个值不是有效的json:"code":001。它应该是"code":"001""code":1
输出量:

id        name       storeid  Source  ...  itemtrack                               info    itemsboughtid                                         stolenitem
0  11   JohnSmith  3221-123-555  online  ...        110  {'haircolor': 'black', 'age': 53}               []  [{'item': 'candy', 'code': 1}, {'item': 'candy...
1  35    BillyDan  3221-123-555  letter  ...        110  {'haircolor': 'black', 'age': 21}  [1, 42, 465, 5]                      [{'item': 'shoe', 'code': 2}]
2  64  NickWalker  3221-123-555  letter  ...        110    {'haircolor': 'red', 'age': 22}           [1, 2]  [{'item': 'sweater', 'code': 11}, {'item': 'ma...

相关问题