我正在尝试从数据框的行中的每个单元格中提取信息,并将它们添加为另一列。
import json
import pandas as pd
df_nested = pd.read_json('train.json')
df_sample = df_nested.sample(n=50, random_state=0)
display(df_sample)
for index, row in df_sample.iterrows():
table_json = row['table']
paragraphs_json = row['paragraphs']
questions_json = row['questions']
table = json.loads(json.dumps(table_json)).get("table")
#print(table)
paragraphs = [json.loads(json.dumps(x)).get("text") for x in paragraphs_json]
#print(paragraphs)
questions = [json.loads(json.dumps(x)).get("question") for x in questions_json]
answer = [json.loads(json.dumps(x)).get("answer") for x in questions_json]
answer_type = [json.loads(json.dumps(x)).get("answer_type") for x in questions_json]
program = [json.loads(json.dumps(x)).get("derivation") for x in questions_json]
print(program)
dataframe是
| 表|段落|问题|
| - -----|- -----|- -----|
| {“uid”:“bf2c6a2f-0b76-4bba-8d3c-2ee02d1b7d73”,“table”:“[[,,,十二月三十一日,,],[,使用寿命,2019,2018],[计算机设备和软件,3 - 5年,$57,474,$52,055],[家具和固定装置,7年,6,096,4,367],[租赁改良,2 - 6年,22,800,9,987],[正在进行的装修,n/a,8,1,984],[按需建造财产,25年,-,51,058],[财产和设备总额,,86,378,119,451],[减:累计折旧和摊销,,(49,852),(42,197)],[财产和设备合计,净额,,$36,526,$77,254]]"}|[{“uid”:“07e28145-95d5-4f9f-b313-ac8c3b4a869f”,“text”:“应收账款”,“订单”:“1”},{“uid”:“b41652f7-0e68-4cf6-9723-fec443b1e604”,“text”:“以下是应收账款汇总表(千):“,“订单”:“2”}|[{“rel_paragraph”:“[2]",“answer_from”:“table-text”,“question”:“该表提供了公司应收账款的哪些年份的信息?“,“scale”:““,“answer_type”:“multi-span”,“req_comparison”:“false”,“order”:“% 1”,“uid”:“53041a93-1d06-48fd-a478-6f690b8da302”,“answer”:“[2019,2018]",“派生”:“"},{“rel_paragraphs”:“[2]",“answer_from”:“table-text”,“question”:“2018年应收账款是多少?“,“scale”:“千”,“answer_type”:“span”,“req_comparison”:“false”,“order”:“% 2”,“uid”:“a196a61c-43b0-43f5-bb4b-b059a1103c54”,“answer”:“[225,167]",“派生”:“"},{“rel_paragraphs”:“[2]",“answer_from”:“table-text”,“question”:“2019年的产品退货准备金是多少?“,“scale”:“千”,“answer_type”:“span”,“req_comparison”:“false”,“order”:“3”,“uid”:“c8656e5e-2bb7-4f03-ae73-0d04492155c0”,“answer”:“[(25,897)]",“衍生”:“"},{“rel_paragraphs”:“[2]",“answer_from”:“table-text”,“question”:“应收账款净额超过20万是多少年的?“,“scale”:““,“answer_type”:“count”,“req_comparison”:“false”,“order”:“4”,“uid”:“fdf08d3d-d570-4c21-9b3e-a3c86e164665”,“answer”:“1”,“派生”:“2018”},{“rel_paragraphs”:“[2]",“answer_from”:“table-text”,“question”:“2018年和2019年之间的可疑账户准备金有什么变化?“,“scale”:“千”,“answer_type”:“arithmetic”,“req_comparison”:“false”,“order”:“5”,“uid”:“6ecb2062-daca-4e1e-900e-2b99b2fce929”,“answer”:“424”,“派生”:“-1,054-(-1,478)"},{“rel_paragraphs”:“[]",“answer_from”:“表”、“问题”:“2018年和2019年之间产品退货补贴的百分比变化是多少?“,“scale”:“百分比”,“answer_type”:“arithmetic”,“req_comparison”:“false”,“order”:“6”,“uid”:“f2c1edad-622d-4959-8cd5-a7f2bd2d7bb1”,“answer”:“129.87”,“推导”:“(-25,897+11,266)/-11,266”}]|
上面的代码不是一个有效的代码。但是,如何将df_sample.iterrows()
的输出相加,即table, questions, answers, answer_type
等。作为原始df_sample
Dataframe 中的另一列
1条答案
按热度按时间tcomlyy61#
使用您提供的dataframe:
下面是一种使用Python内置函数isinstance和“walrus”以及Pandas explode、json_normalize和concat的方法:
然后: