解析dataframe中的JSON字符串并将提取的信息插入到另一列中

bihw5rsg 于 2023-05-30 发布在其他

关注(0)|答案(1)|浏览(391)

我正在尝试从数据框的行中的每个单元格中提取信息，并将它们添加为另一列。

import json
import pandas as pd
df_nested = pd.read_json('train.json')
df_sample = df_nested.sample(n=50, random_state=0)
display(df_sample)

for index, row in df_sample.iterrows():
    table_json = row['table']
    paragraphs_json = row['paragraphs']
    questions_json = row['questions']
    table = json.loads(json.dumps(table_json)).get("table")
    #print(table)
    paragraphs = [json.loads(json.dumps(x)).get("text") for x in paragraphs_json]
    #print(paragraphs)
    questions = [json.loads(json.dumps(x)).get("question") for x in questions_json]
    answer = [json.loads(json.dumps(x)).get("answer") for x in questions_json]
    answer_type = [json.loads(json.dumps(x)).get("answer_type") for x in questions_json]
    program = [json.loads(json.dumps(x)).get("derivation") for x in questions_json]
    print(program)

dataframe是
| 表|段落|问题|
| - -----|- -----|- -----|
| {“uid”：“bf2c6a2f-0b76-4bba-8d3c-2ee02d1b7d73”，“table”：“[[，，，十二月三十一日，，]，[，使用寿命，2019，2018]，[计算机设备和软件，3 - 5年，$57，474，$52，055]，[家具和固定装置，7年，6，096，4，367]，[租赁改良，2 - 6年，22，800，9，987]，[正在进行的装修，n/a，8，1，984]，[按需建造财产，25年，-，51，058]，[财产和设备总额，，86，378，119，451]，[减：累计折旧和摊销，，（49，852），（42，197）]，[财产和设备合计，净额，，$36，526，$77，254]]"}|[{“uid”：“07e28145-95d5-4f9f-b313-ac8c3b4a869f”，“text”：“应收账款”，“订单”：“1”}，{“uid”：“b41652f7-0e68-4cf6-9723-fec443b1e604”，“text”：“以下是应收账款汇总表（千）：“，“订单”：“2”}|[{“rel_paragraph”：“[2]"，“answer_from”：“table-text”，“question”：“该表提供了公司应收账款的哪些年份的信息？“，“scale”：““，“answer_type”：“multi-span”，“req_comparison”：“false”，“order”：“% 1”，“uid”：“53041a93-1d06-48fd-a478-6f690b8da302”，“answer”：“[2019，2018]"，“派生”：“"}，{“rel_paragraphs”：“[2]"，“answer_from”：“table-text”，“question”：“2018年应收账款是多少？“，“scale”：“千”，“answer_type”：“span”，“req_comparison”：“false”，“order”：“% 2”，“uid”：“a196a61c-43b0-43f5-bb4b-b059a1103c54”，“answer”：“[225，167]"，“派生”：“"}，{“rel_paragraphs”：“[2]"，“answer_from”：“table-text”，“question”：“2019年的产品退货准备金是多少？“，“scale”：“千”，“answer_type”：“span”，“req_comparison”：“false”，“order”：“3”，“uid”：“c8656e5e-2bb7-4f03-ae73-0d04492155c0”，“answer”：“[（25，897）]"，“衍生”：“"}，{“rel_paragraphs”：“[2]"，“answer_from”：“table-text”，“question”：“应收账款净额超过20万是多少年的？“，“scale”：““，“answer_type”：“count”，“req_comparison”：“false”，“order”：“4”，“uid”：“fdf08d3d-d570-4c21-9b3e-a3c86e164665”，“answer”：“1”，“派生”：“2018”}，{“rel_paragraphs”：“[2]"，“answer_from”：“table-text”，“question”：“2018年和2019年之间的可疑账户准备金有什么变化？“，“scale”：“千”，“answer_type”：“arithmetic”，“req_comparison”：“false”，“order”：“5”，“uid”：“6ecb2062-daca-4e1e-900e-2b99b2fce929”，“answer”：“424”，“派生”：“-1，054-（-1，478）"}，{“rel_paragraphs”：“[]"，“answer_from”：“表”、“问题”：“2018年和2019年之间产品退货补贴的百分比变化是多少？“，“scale”：“百分比”，“answer_type”：“arithmetic”，“req_comparison”：“false”，“order”：“6”，“uid”：“f2c1edad-622d-4959-8cd5-a7f2bd2d7bb1”，“answer”：“129.87”，“推导”：“（-25,897+11,266）/-11,266”}]|
上面的代码不是一个有效的代码。但是，如何将df_sample.iterrows()的输出相加，即table, questions, answers, answer_type等。作为原始df_sample Dataframe 中的另一列

JSON

来源：https://stackoverflow.com/questions/76331409/parse-json-string-within-dataframe-and-insert-extracted-information-into-another

1条答案

按热度按时间

tcomlyy61#

使用您提供的dataframe：

import pandas as pd

df = pd.DataFrame(
    {
        "table": [
            {
                "uid": "bf2c6a2f-0b76-4bba-8d3c-2ee02d1b7d73",
                "table": "[[, , December 31,,], [, Useful Life, 2019, 2018], [Computer equipment and software, 3 â\x80\x93 5 years, $57,474, $52,055], [Furniture and fixtures, 7 years, 6,096, 4,367], [Leasehold improvements, 2 â\x80\x93 6 years, 22,800, 9,987], [Renovation in progress, n/a, 8, 1,984], [Build-to-suit property, 25 years, â\x80\x94, 51,058], [Total property and equipment, gross, , 86,378, 119,451], [Less: accumulated depreciation and amortization, , (49,852), (42,197)], [Total property and equipment, net, , $36,526, $77,254]]",
            }
        ],
        "paragraphs": [
            [
                {
                    "uid": "07e28145-95d5-4f9f-b313-ac8c3b4a869f",
                    "text": "Accounts Receivable",
                    "order": "1",
                },
                {
                    "uid": "b41652f7-0e68-4cf6-9723-fec443b1e604",
                    "text": "The following is a summary of Accounts receivable (in thousands):",
                    "order": "2",
                },
            ]
        ],
        "questions": [
            [
                {
                    "rel_paragraphs": "[2]",
                    "answer_from": "table-text",
                    "question": "Which years does the table provide information for the company's Accounts receivable?",
                    "scale": "",
                    "answer_type": "multi-span",
                    "req_comparison": "false",
                    "order": "1",
                    "uid": "53041a93-1d06-48fd-a478-6f690b8da302",
                    "answer": "[2019, 2018]",
                    "derivation": "",
                },
                {
                    "rel_paragraphs": "[2]",
                    "answer_from": "table-text",
                    "question": "What was the amount of accounts receivable in 2018?",
                    "scale": "thousand",
                    "answer_type": "span",
                    "req_comparison": "false",
                    "order": "2",
                    "uid": "a196a61c-43b0-43f5-bb4b-b059a1103c54",
                    "answer": "[225,167]",
                    "derivation": "",
                },
                {
                    "rel_paragraphs": "[2]",
                    "answer_from": "table-text",
                    "question": "What was the allowance for product returns in 2019?",
                    "scale": "thousand",
                    "answer_type": "span",
                    "req_comparison": "false",
                    "order": "3",
                    "uid": "c8656e5e-2bb7-4f03-ae73-0d04492155c0",
                    "answer": "[(25,897)]",
                    "derivation": "",
                },
                {
                    "rel_paragraphs": "[2]",
                    "answer_from": "table-text",
                    "question": "How many years did the net accounts receivable exceed $200,000 thousand?",
                    "scale": "",
                    "answer_type": "count",
                    "req_comparison": "false",
                    "order": "4",
                    "uid": "fdf08d3d-d570-4c21-9b3e-a3c86e164665",
                    "answer": "1",
                    "derivation": "2018",
                },
                {
                    "rel_paragraphs": "[2]",
                    "answer_from": "table-text",
                    "question": "What was the change in the Allowance for doubtful accounts between 2018 and 2019?",
                    "scale": "thousand",
                    "answer_type": "arithmetic",
                    "req_comparison": "false",
                    "order": "5",
                    "uid": "6ecb2062-daca-4e1e-900e-2b99b2fce929",
                    "answer": "424",
                    "derivation": "-1,054-(-1,478)",
                },
                {
                    "rel_paragraphs": "[]",
                    "answer_from": "table",
                    "question": "What was the percentage change in the Allowance for product returns between 2018 and 2019?",
                    "scale": "percent",
                    "answer_type": "arithmetic",
                    "req_comparison": "false",
                    "order": "6",
                    "uid": "f2c1edad-622d-4959-8cd5-a7f2bd2d7bb1",
                    "answer": "129.87",
                    "derivation": "(-25,897+11,266)/-11,266",
                },
            ]
        ],
    }
)

下面是一种使用Python内置函数isinstance和“walrus”以及Pandas explode、json_normalize和concat的方法：

for col in df.columns:
    # Deal with columns containing lists of json
    if df[col].apply(lambda x: isinstance(x, list)).all():
        df = df.explode(col, ignore_index=True)
    # Deal with json
    if not (new_cols := pd.json_normalize(df[col])).empty:
        df = pd.concat([df.drop(columns=col), new_cols], axis=1).drop(columns="uid")

然后：

print(df)
# Output

                       table                      text order rel_paragraphs   
0   [[, , December 31,,],...       Accounts Receivable     1            [2]  \
1   [[, , December 31,,],...       Accounts Receivable     1            [2]   
2   [[, , December 31,,],...       Accounts Receivable     1            [2]   
3   [[, , December 31,,],...       Accounts Receivable     1            [2]   
4   [[, , December 31,,],...       Accounts Receivable     1            [2]   
5   [[, , December 31,,],...       Accounts Receivable     1             []   
6   [[, , December 31,,],...  The following is a su...     2            [2]   
7   [[, , December 31,,],...  The following is a su...     2            [2]   
8   [[, , December 31,,],...  The following is a su...     2            [2]   
9   [[, , December 31,,],...  The following is a su...     2            [2]   
10  [[, , December 31,,],...  The following is a su...     2            [2]   
11  [[, , December 31,,],...  The following is a su...     2             []   

   answer_from                  question     scale answer_type req_comparison   
0   table-text  Which years does the ...            multi-span          false  \
1   table-text  What was the amount o...  thousand        span          false   
2   table-text  What was the allowanc...  thousand        span          false   
3   table-text  How many years did th...                 count          false   
4   table-text  What was the change i...  thousand  arithmetic          false   
5        table  What was the percenta...   percent  arithmetic          false   
6   table-text  Which years does the ...            multi-span          false   
7   table-text  What was the amount o...  thousand        span          false   
8   table-text  What was the allowanc...  thousand        span          false   
9   table-text  How many years did th...                 count          false   
10  table-text  What was the change i...  thousand  arithmetic          false   
11       table  What was the percenta...   percent  arithmetic          false   

   order        answer                derivation  
0      1  [2019, 2018]
1      2     [225,167]
2      3    [(25,897)]
3      4             1                      2018  
4      5           424           -1,054-(-1,478)  
5      6        129.87  (-25,897+11,266)/-11,266  
6      1  [2019, 2018]
7      2     [225,167]
8      3    [(25,897)]
9      4             1                      2018  
10     5           424           -1,054-(-1,478)  
11     6        129.87  (-25,897+11,266)/-11,266

赞(0）回复(0）举报 2023-05-30

我来回答

解析dataframe中的JSON字符串并将提取的信息插入到另一列中

1条答案

相关问题

热门标签

最新问答