在unicode中将pandas DataFrame写入JSON

kiz8lqtg 于 2023-04-19 发布在其他

关注(0)|答案(3)|浏览(124)

我尝试编写一个包含unicode的pandas DataFrame到json，但是内置的.to_json函数转义了字符。我该如何解决这个问题？
示例：

import pandas as pd
df = pd.DataFrame([['τ', 'a', 1], ['π', 'b', 2]])
df.to_json('df.json')

这给出：

{"0":{"0":"\u03c4","1":"\u03c0"},"1":{"0":"a","1":"b"},"2":{"0":1,"1":2}}

与预期结果不同的：

{"0":{"0":"τ","1":"π"},"1":{"0":"a","1":"b"},"2":{"0":1,"1":2}}

我尝试添加force_ascii=False参数：

import pandas as pd
df = pd.DataFrame([['τ', 'a', 1], ['π', 'b', 2]])
df.to_json('df.json', force_ascii=False)

但这会导致以下错误：

UnicodeEncodeError: 'charmap' codec can't encode character '\u03c4' in position 11: character maps to <undefined>

我使用WinPython3.4.4.264位与Pandas0.18.0

pandas

来源：https://stackoverflow.com/questions/39612240/writing-pandas-dataframe-to-json-in-unicode

3条答案

按热度按时间

j5fpnvbx1#

打开一个编码设置为utf-8的文件，然后将该文件传递给.to_json函数可以解决这个问题：

with open('df.json', 'w', encoding='utf-8') as file:
    df.to_json(file, force_ascii=False)

给出正确的：

{"0":{"0":"τ","1":"π"},"1":{"0":"a","1":"b"},"2":{"0":1,"1":2}}

注意：它仍然需要force_ascii=False参数。

赞(0）回复(0）举报 2023-04-19

piwo6bdm2#

还有另一种方法可以做到这一点，因为JSON由键（双引号中的字符串）和值（字符串、数字、嵌套的JSON或数组）组成，而且它与Python的字典非常相似，所以可以使用简单的转换和字符串操作从Pandas DataFrame中获取JSON

import pandas as pd
df = pd.DataFrame([['τ', 'a', 1], ['π', 'b', 2]])

# convert index values to string (when they're something else - JSON requires strings for keys)
df.index = df.index.map(str)
# convert column names to string (when they're something else - JSON requires strings for keys)
df.columns = df.columns.map(str)

# convert DataFrame to dict, dict to string and simply jsonify quotes from single to double quotes  
js = str(df.to_dict()).replace("'", '"')
print(js) # print or write to file or return as REST...anything you want

输出：

{"0": {"0": "τ", "1": "π"}, "1": {"0": "a", "1": "b"}, "2": {"0": 1, "1": 2}}

更新：根据@Swier的说明（谢谢），在原始 Dataframe 中包含双引号的字符串可能会有问题。df.jsonify()会转义它们（即'"a"'会生成JSON格式的"\\"a\\""）。在字符串方法的小更新的帮助下，也可以处理这个问题。完整的示例：

import pandas as pd

def run_jsonifier(df):
    # convert index values to string (when they're something else)
    df.index = df.index.map(str)
    # convert column names to string (when they're something else)
    df.columns = df.columns.map(str)

    # convert DataFrame to dict and dict to string
    js = str(df.to_dict())
    #store indices of double quote marks in string for later update
    idx = [i for i, _ in enumerate(js) if _ == '"']
    # jsonify quotes from single to double quotes  
    js = js.replace("'", '"')
    # add \ to original double quotes to make it json-like escape sequence 
    for add, i in enumerate(idx):
        js = js[:i+add] + '\\' + js[i+add:] 
    return js

# define double-quotes-rich dataframe
df = pd.DataFrame([['τ', '"a"', 1], ['π', 'this" breaks >>"<""< ', 2]])

# run our function to convert dataframe to json
print(run_jsonifier(df))
# run original `to_json()` to see difference
print(df.to_json())

输出：

{"0": {"0": "τ", "1": "π"}, "1": {"0": "\"a\"", "1": "this\" breaks >>\"<\"\"< "}, "2": {"0": 1, "1": 2}}
{"0":{"0":"\u03c4","1":"\u03c0"},"1":{"0":"\"a\"","1":"this\" breaks >>\"<\"\"< "},"2":{"0":1,"1":2}}

赞(0）回复(0）举报 2023-04-19

deyfvvtc3#

我也遇到了同样的问题，虽然我没有写文件。解决方案是将字符串编码为'utf-8'：df.to_json(force_ascii=False).encode('utf-8')

赞(0）回复(0）举报 2023-04-19