如何使用Python格式化数据/列,以修复每个记录ID之间的不对齐问题,以便数据在表中显示一致?

llmtgqce  于 2023-05-19  发布在  Python
关注(0)|答案(1)|浏览(154)

为了上下文,我编写了以下Python代码,在我处理JSON数据之后,以规范化格式将JSON数据导出到CSV文件中(此时,我只导出了两个记录ID用于测试目的):

import pandas as pd
import os
from openpyxl.workbook import Workbook
import csv
from pathlib import Path
from pandas.io.json import json_normalize
import json
from datetime import datetime
from datetime import date
from datetime import timedelta
import psycopg2
from psycopg2 import OperationalError
# Import our files
import pg  # Various functions to interface with the Postgres servers
from db_creds import *  # The DB server and user creds

#try:
    # Connect to an existing database
connection = pg.create_connection(sourceDB_setting[3], sourceDB_setting[5], sourceDB_setting[6], sourceDB_setting[1], sourceDB_setting[2])
#Create a cursor to perform database operations
cursor = connection.cursor()

cursor.execute("SELECT application_data, id FROM ssap_applications LIMIT 2;")
results = cursor.fetchall()

#if os.path.exists('App_data3.csv'):
    #os.remove('App_data3.csv')

for row in results:
    jrec, app_id = row
    # Process each row here
    #print(jrec)
    jrec = json.loads(jrec)
    normal_json = pd.json_normalize(jrec)
    normal_json.columns = normal_json.columns.str.split('.').str[-1]
    print(normal_json.columns)
    #normal_json.columns = normal_json.columns.str.startswith("singleStreamlinedApplication.authorizedRepresentative.")
    #normal_json.columns = normal_json.columns.str.lstrip("singleStreamlinedApplication")
    #normal_json.columns = normal_json.columns.str.lstrip(".")
    print(normal_json)
    # save to csv
    normal_json.to_csv('App_data3.csv', header=True, index=False, mode='a', encoding='utf-8')

cursor.close()

导出两个记录ID后,我看到第一个ID有75列,第二个ID有71列。这是一个问题,因为列数的变化会导致电子表格表格中的数据不对齐。下面是一个简化的例子:
第一个ID:

ApplicationType  applicationStatus  hasUserElectronicallySigned  applyingForhouseHold  applyingForFinancialAssistanceIndicator
OE                  SIGNED                                       houseHoldContactOnly  FALSE

第二个ID:

ApplicationType  applicationStatus   applyingForhouseHold   applyingForFinancialAssistanceIndicator       
OE               SIGNED              houseHoldContactOnly   TRUE

我想要的是:

第一个ID:

ApplicationType  applicationStatus   applyingForhouseHold applyingForFinancialAssistanceIndicator   hasUserElectronicallySigned
OE                  SIGNED           houseHoldContactOnly  FALSE

第二个ID:

ApplicationType  applicationStatus   applyingForhouseHold   applyingForFinancialAssistanceIndicator   hasUserElectronicallySigned       
OE               SIGNED              houseHoldContactOnly   TRUE

所以基本上,我希望每个ID的列数对齐。如果第一个ID有75列,第二个ID也应该有75列。并且列的排列应该是一致的,如果列1、2、3、4、5是ApplicationType、applicationStatus、applyingForhouseHold、applyingForFinancialAssistanceIndicator、hasUserElectronicallySigned,那么第一个和第二个ID应该分别以列的这个顺序显示正确的数据。如何修改Python脚本以实现所有ID的预期结果?
仅供参考,这是简化版本中每个记录ID的原始JSON数据(我的意思是ssapApplicationId,在这种情况下,您可以出于测试目的制作ID 1和ID 2等ID,为了简单起见,我只保留了几列,而不是75,71列)
第一个ID:

{'singleStreamlinedApplication': {'ApplicationType': 'OE', 'applicationStatus': 'SIGNED', 'authorizedRepresentative': {'hasUserElectronicallySigned': None}, 'applyingForhouseHold': 'houseHoldContactOnly', 'applyingForFinancialAssistanceIndicator': False}}

第二个ID:

{'singleStreamlinedApplication': {'ApplicationType': 'OE', 'applicationStatus': 'SIGNED', 'applyingForhouseHold': 'houseHoldContactOnly', 'applyingForFinancialAssistanceIndicator': True}}

我想使用pandas.DataFrame.align,但不确定它是否对我有帮助。

如果需要我提供更多信息,请告诉我,以便澄清。

aurhwmvo

aurhwmvo1#

解决了。这是我所做的,功劳归于https://www.reddit.com/user/danielroseman/

normalized = []
for row in results:
jrec, app_id = row
jrec = json.loads(jrec)
jrec = pd.json_normalize(jrec)
jrec.columns = jrec.columns.str.split('.').str[-1]
normalized.append(jrec)

combined = pd.concat(normalized)
combined.to_csv('App_data1.csv', header=True, index=False, encoding='utf-8')

基本上,我只是将所有数据保存到一个dataframe中,然后一次将它们全部导出。

相关问题