解析dataframe中的JSON字符串并将提取的信息插入到另一列中

bihw5rsg  于 2023-05-30  发布在  其他
关注(0)|答案(1)|浏览(391)

我正在尝试从数据框的行中的每个单元格中提取信息,并将它们添加为另一列。

import json
import pandas as pd
df_nested = pd.read_json('train.json')
df_sample = df_nested.sample(n=50, random_state=0)
display(df_sample)

for index, row in df_sample.iterrows():
    table_json = row['table']
    paragraphs_json = row['paragraphs']
    questions_json = row['questions']
    table = json.loads(json.dumps(table_json)).get("table")
    #print(table)
    paragraphs = [json.loads(json.dumps(x)).get("text") for x in paragraphs_json]
    #print(paragraphs)
    questions = [json.loads(json.dumps(x)).get("question") for x in questions_json]
    answer = [json.loads(json.dumps(x)).get("answer") for x in questions_json]
    answer_type = [json.loads(json.dumps(x)).get("answer_type") for x in questions_json]
    program = [json.loads(json.dumps(x)).get("derivation") for x in questions_json]
    print(program)

dataframe是
| 表|段落|问题|
| - -----|- -----|- -----|
| {“uid”:“bf2c6a2f-0b76-4bba-8d3c-2ee02d1b7d73”,“table”:“[[,,,十二月三十一日,,],[,使用寿命,2019,2018],[计算机设备和软件,3 - 5年,$57,474,$52,055],[家具和固定装置,7年,6,096,4,367],[租赁改良,2 - 6年,22,800,9,987],[正在进行的装修,n/a,8,1,984],[按需建造财产,25年,-,51,058],[财产和设备总额,,86,378,119,451],[减:累计折旧和摊销,,(49,852),(42,197)],[财产和设备合计,净额,,$36,526,$77,254]]"}|[{“uid”:“07e28145-95d5-4f9f-b313-ac8c3b4a869f”,“text”:“应收账款”,“订单”:“1”},{“uid”:“b41652f7-0e68-4cf6-9723-fec443b1e604”,“text”:“以下是应收账款汇总表(千):“,“订单”:“2”}|[{“rel_paragraph”:“[2]",“answer_from”:“table-text”,“question”:“该表提供了公司应收账款的哪些年份的信息?“,“scale”:““,“answer_type”:“multi-span”,“req_comparison”:“false”,“order”:“% 1”,“uid”:“53041a93-1d06-48fd-a478-6f690b8da302”,“answer”:“[2019,2018]",“派生”:“"},{“rel_paragraphs”:“[2]",“answer_from”:“table-text”,“question”:“2018年应收账款是多少?“,“scale”:“千”,“answer_type”:“span”,“req_comparison”:“false”,“order”:“% 2”,“uid”:“a196a61c-43b0-43f5-bb4b-b059a1103c54”,“answer”:“[225,167]",“派生”:“"},{“rel_paragraphs”:“[2]",“answer_from”:“table-text”,“question”:“2019年的产品退货准备金是多少?“,“scale”:“千”,“answer_type”:“span”,“req_comparison”:“false”,“order”:“3”,“uid”:“c8656e5e-2bb7-4f03-ae73-0d04492155c0”,“answer”:“[(25,897)]",“衍生”:“"},{“rel_paragraphs”:“[2]",“answer_from”:“table-text”,“question”:“应收账款净额超过20万是多少年的?“,“scale”:““,“answer_type”:“count”,“req_comparison”:“false”,“order”:“4”,“uid”:“fdf08d3d-d570-4c21-9b3e-a3c86e164665”,“answer”:“1”,“派生”:“2018”},{“rel_paragraphs”:“[2]",“answer_from”:“table-text”,“question”:“2018年和2019年之间的可疑账户准备金有什么变化?“,“scale”:“千”,“answer_type”:“arithmetic”,“req_comparison”:“false”,“order”:“5”,“uid”:“6ecb2062-daca-4e1e-900e-2b99b2fce929”,“answer”:“424”,“派生”:“-1,054-(-1,478)"},{“rel_paragraphs”:“[]",“answer_from”:“表”、“问题”:“2018年和2019年之间产品退货补贴的百分比变化是多少?“,“scale”:“百分比”,“answer_type”:“arithmetic”,“req_comparison”:“false”,“order”:“6”,“uid”:“f2c1edad-622d-4959-8cd5-a7f2bd2d7bb1”,“answer”:“129.87”,“推导”:“(-25,897+11,266)/-11,266”}]|
上面的代码不是一个有效的代码。但是,如何将df_sample.iterrows()的输出相加,即table, questions, answers, answer_type等。作为原始df_sample Dataframe 中的另一列

tcomlyy6

tcomlyy61#

使用您提供的dataframe:

import pandas as pd

df = pd.DataFrame(
    {
        "table": [
            {
                "uid": "bf2c6a2f-0b76-4bba-8d3c-2ee02d1b7d73",
                "table": "[[, , December 31,,], [, Useful Life, 2019, 2018], [Computer equipment and software, 3 â\x80\x93 5 years, $57,474, $52,055], [Furniture and fixtures, 7 years, 6,096, 4,367], [Leasehold improvements, 2 â\x80\x93 6 years, 22,800, 9,987], [Renovation in progress, n/a, 8, 1,984], [Build-to-suit property, 25 years, â\x80\x94, 51,058], [Total property and equipment, gross, , 86,378, 119,451], [Less: accumulated depreciation and amortization, , (49,852), (42,197)], [Total property and equipment, net, , $36,526, $77,254]]",
            }
        ],
        "paragraphs": [
            [
                {
                    "uid": "07e28145-95d5-4f9f-b313-ac8c3b4a869f",
                    "text": "Accounts Receivable",
                    "order": "1",
                },
                {
                    "uid": "b41652f7-0e68-4cf6-9723-fec443b1e604",
                    "text": "The following is a summary of Accounts receivable (in thousands):",
                    "order": "2",
                },
            ]
        ],
        "questions": [
            [
                {
                    "rel_paragraphs": "[2]",
                    "answer_from": "table-text",
                    "question": "Which years does the table provide information for the company's Accounts receivable?",
                    "scale": "",
                    "answer_type": "multi-span",
                    "req_comparison": "false",
                    "order": "1",
                    "uid": "53041a93-1d06-48fd-a478-6f690b8da302",
                    "answer": "[2019, 2018]",
                    "derivation": "",
                },
                {
                    "rel_paragraphs": "[2]",
                    "answer_from": "table-text",
                    "question": "What was the amount of accounts receivable in 2018?",
                    "scale": "thousand",
                    "answer_type": "span",
                    "req_comparison": "false",
                    "order": "2",
                    "uid": "a196a61c-43b0-43f5-bb4b-b059a1103c54",
                    "answer": "[225,167]",
                    "derivation": "",
                },
                {
                    "rel_paragraphs": "[2]",
                    "answer_from": "table-text",
                    "question": "What was the allowance for product returns in 2019?",
                    "scale": "thousand",
                    "answer_type": "span",
                    "req_comparison": "false",
                    "order": "3",
                    "uid": "c8656e5e-2bb7-4f03-ae73-0d04492155c0",
                    "answer": "[(25,897)]",
                    "derivation": "",
                },
                {
                    "rel_paragraphs": "[2]",
                    "answer_from": "table-text",
                    "question": "How many years did the net accounts receivable exceed $200,000 thousand?",
                    "scale": "",
                    "answer_type": "count",
                    "req_comparison": "false",
                    "order": "4",
                    "uid": "fdf08d3d-d570-4c21-9b3e-a3c86e164665",
                    "answer": "1",
                    "derivation": "2018",
                },
                {
                    "rel_paragraphs": "[2]",
                    "answer_from": "table-text",
                    "question": "What was the change in the Allowance for doubtful accounts between 2018 and 2019?",
                    "scale": "thousand",
                    "answer_type": "arithmetic",
                    "req_comparison": "false",
                    "order": "5",
                    "uid": "6ecb2062-daca-4e1e-900e-2b99b2fce929",
                    "answer": "424",
                    "derivation": "-1,054-(-1,478)",
                },
                {
                    "rel_paragraphs": "[]",
                    "answer_from": "table",
                    "question": "What was the percentage change in the Allowance for product returns between 2018 and 2019?",
                    "scale": "percent",
                    "answer_type": "arithmetic",
                    "req_comparison": "false",
                    "order": "6",
                    "uid": "f2c1edad-622d-4959-8cd5-a7f2bd2d7bb1",
                    "answer": "129.87",
                    "derivation": "(-25,897+11,266)/-11,266",
                },
            ]
        ],
    }
)

下面是一种使用Python内置函数isinstance和“walrus”以及Pandas explodejson_normalizeconcat的方法:

for col in df.columns:
    # Deal with columns containing lists of json
    if df[col].apply(lambda x: isinstance(x, list)).all():
        df = df.explode(col, ignore_index=True)
    # Deal with json
    if not (new_cols := pd.json_normalize(df[col])).empty:
        df = pd.concat([df.drop(columns=col), new_cols], axis=1).drop(columns="uid")

然后:

print(df)
# Output

                       table                      text order rel_paragraphs   
0   [[, , December 31,,],...       Accounts Receivable     1            [2]  \
1   [[, , December 31,,],...       Accounts Receivable     1            [2]   
2   [[, , December 31,,],...       Accounts Receivable     1            [2]   
3   [[, , December 31,,],...       Accounts Receivable     1            [2]   
4   [[, , December 31,,],...       Accounts Receivable     1            [2]   
5   [[, , December 31,,],...       Accounts Receivable     1             []   
6   [[, , December 31,,],...  The following is a su...     2            [2]   
7   [[, , December 31,,],...  The following is a su...     2            [2]   
8   [[, , December 31,,],...  The following is a su...     2            [2]   
9   [[, , December 31,,],...  The following is a su...     2            [2]   
10  [[, , December 31,,],...  The following is a su...     2            [2]   
11  [[, , December 31,,],...  The following is a su...     2             []   

   answer_from                  question     scale answer_type req_comparison   
0   table-text  Which years does the ...            multi-span          false  \
1   table-text  What was the amount o...  thousand        span          false   
2   table-text  What was the allowanc...  thousand        span          false   
3   table-text  How many years did th...                 count          false   
4   table-text  What was the change i...  thousand  arithmetic          false   
5        table  What was the percenta...   percent  arithmetic          false   
6   table-text  Which years does the ...            multi-span          false   
7   table-text  What was the amount o...  thousand        span          false   
8   table-text  What was the allowanc...  thousand        span          false   
9   table-text  How many years did th...                 count          false   
10  table-text  What was the change i...  thousand  arithmetic          false   
11       table  What was the percenta...   percent  arithmetic          false   

   order        answer                derivation  
0      1  [2019, 2018]
1      2     [225,167]
2      3    [(25,897)]
3      4             1                      2018  
4      5           424           -1,054-(-1,478)  
5      6        129.87  (-25,897+11,266)/-11,266  
6      1  [2019, 2018]
7      2     [225,167]
8      3    [(25,897)]
9      4             1                      2018  
10     5           424           -1,054-(-1,478)  
11     6        129.87  (-25,897+11,266)/-11,266

相关问题