为了上下文,我编写了以下Python代码,在我处理JSON数据之后,以规范化格式将JSON数据导出到CSV文件中(此时,我只导出了两个记录ID用于测试目的):
import pandas as pd
import os
from openpyxl.workbook import Workbook
import csv
from pathlib import Path
from pandas.io.json import json_normalize
import json
from datetime import datetime
from datetime import date
from datetime import timedelta
import psycopg2
from psycopg2 import OperationalError
# Import our files
import pg # Various functions to interface with the Postgres servers
from db_creds import * # The DB server and user creds
#try:
# Connect to an existing database
connection = pg.create_connection(sourceDB_setting[3], sourceDB_setting[5], sourceDB_setting[6], sourceDB_setting[1], sourceDB_setting[2])
#Create a cursor to perform database operations
cursor = connection.cursor()
cursor.execute("SELECT application_data, id FROM ssap_applications LIMIT 2;")
results = cursor.fetchall()
#if os.path.exists('App_data3.csv'):
#os.remove('App_data3.csv')
for row in results:
jrec, app_id = row
# Process each row here
#print(jrec)
jrec = json.loads(jrec)
normal_json = pd.json_normalize(jrec)
normal_json.columns = normal_json.columns.str.split('.').str[-1]
print(normal_json.columns)
#normal_json.columns = normal_json.columns.str.startswith("singleStreamlinedApplication.authorizedRepresentative.")
#normal_json.columns = normal_json.columns.str.lstrip("singleStreamlinedApplication")
#normal_json.columns = normal_json.columns.str.lstrip(".")
print(normal_json)
# save to csv
normal_json.to_csv('App_data3.csv', header=True, index=False, mode='a', encoding='utf-8')
cursor.close()
导出两个记录ID后,我看到第一个ID有75列,第二个ID有71列。这是一个问题,因为列数的变化会导致电子表格表格中的数据不对齐。下面是一个简化的例子:
第一个ID:
ApplicationType applicationStatus hasUserElectronicallySigned applyingForhouseHold applyingForFinancialAssistanceIndicator
OE SIGNED houseHoldContactOnly FALSE
第二个ID:
ApplicationType applicationStatus applyingForhouseHold applyingForFinancialAssistanceIndicator
OE SIGNED houseHoldContactOnly TRUE
我想要的是:
第一个ID:
ApplicationType applicationStatus applyingForhouseHold applyingForFinancialAssistanceIndicator hasUserElectronicallySigned
OE SIGNED houseHoldContactOnly FALSE
第二个ID:
ApplicationType applicationStatus applyingForhouseHold applyingForFinancialAssistanceIndicator hasUserElectronicallySigned
OE SIGNED houseHoldContactOnly TRUE
所以基本上,我希望每个ID的列数对齐。如果第一个ID有75列,第二个ID也应该有75列。并且列的排列应该是一致的,如果列1、2、3、4、5是ApplicationType、applicationStatus、applyingForhouseHold、applyingForFinancialAssistanceIndicator、hasUserElectronicallySigned,那么第一个和第二个ID应该分别以列的这个顺序显示正确的数据。如何修改Python脚本以实现所有ID的预期结果?
仅供参考,这是简化版本中每个记录ID的原始JSON数据(我的意思是ssapApplicationId,在这种情况下,您可以出于测试目的制作ID 1和ID 2等ID,为了简单起见,我只保留了几列,而不是75,71列)
第一个ID:
{'singleStreamlinedApplication': {'ApplicationType': 'OE', 'applicationStatus': 'SIGNED', 'authorizedRepresentative': {'hasUserElectronicallySigned': None}, 'applyingForhouseHold': 'houseHoldContactOnly', 'applyingForFinancialAssistanceIndicator': False}}
第二个ID:
{'singleStreamlinedApplication': {'ApplicationType': 'OE', 'applicationStatus': 'SIGNED', 'applyingForhouseHold': 'houseHoldContactOnly', 'applyingForFinancialAssistanceIndicator': True}}
我想使用pandas.DataFrame.align
,但不确定它是否对我有帮助。
如果需要我提供更多信息,请告诉我,以便澄清。
1条答案
按热度按时间aurhwmvo1#
解决了。这是我所做的,功劳归于https://www.reddit.com/user/danielroseman/:
基本上,我只是将所有数据保存到一个dataframe中,然后一次将它们全部导出。