更pythonic(和更快)的方式来提取的dicts从字段列表,解析,并添加为行的DF

vlf7wbxs  于 2023-05-30  发布在  Python
关注(0)|答案(1)|浏览(178)

我有一个DF,它有一堆包含JSON信息列表的字段(调用)。

***注意:***对于其中的一些pd.json_normalize()工作,但对于其他人,我必须手动解析数据(不知道为什么json_normalize不能与这些工作。也许是因为它们是JSON的列表,而不仅仅是JSON。
代码

# This works on the first field, which is just JSON, not a list of JSONs
expanded_df = pd.concat([method_df, 
                pd.json_normalize(method_df.reportableinformation_json)], 
                axis=1).drop('reportableinformation_json', axis = 1)

# But the other fields, I do this...
def parseSections(row, prefix):
    columns = row.index
    
    _df = pd.DataFrame(row).transpose()

    for section in row[prefix]:
        _df = pd.concat([_df, pd.json_normalize(section)])

    for col in columns:
        _df[col].fillna(method='pad', inplace=True)

    _df = _df.rename(columns={  'ID': f'{prefix}.ID', 
                                'Name': f'{prefix},.Name',
                                'Tables': f'{prefix}.Tables',
                                'Sections': f'{prefix}.Sections',
                                'Parameters': f'{prefix}.Parameters',
                       })    

    return _df

# This calls parseSections with just one row. 
# Ultimately, I'd like to call this with either _df.apply() or using `for idx, row in _df.iterrow()`
parseSections(_df.iloc[0], prefix = "MethodDescription.Sections")

原件

| | id|结果集ID| resultsetrevision_id|修订|姓名|方法类型|标识符|创建|创建者|最后修改|最后修改者|audittrail_json|批准状态|风味|仪器技术|文件名|路径|最后保存的|身份证|韦尔| MethodDescription.ID | MethodDescription.Name | MethodDescription.Tables | MethodDescription.Sections | MethodDescription.Parameters |
| - -----|- -----|- -----|- -----|- -----|- -----|- -----|- -----|- -----|- -----|- -----|- -----|- -----|- -----|- -----|- -----|- -----|- -----|- -----|- -----|- -----|- -----|- -----|- -----|- -----|- -----|
| 0| 44547fcf-a26e-49a1-a| 62abc4ca-bf13-4f27-b| 30cc3932-ef93-4aec-b| 2022-0722-1502-14947|名称氢定量|pmx方法|0000000 - 0000-0000-0| 2019-03-27 22:01:27.|系统|2022-10-12 04:17:32.|用户名|[]|通用|GC/LC定量|无|名称氢定量|/project/path/|2022-10-14 05:25:42.|粤ICP备16036666号B| 0|无|方法|[]|['ID':'方法_信息|[]|

之后

| | id|结果集ID| resultsetrevision_id|修订|姓名|方法类型|标识符|创建|创建者|最后修改|最后修改者|audittrail_json|批准状态|风味|仪器技术|文件名|路径|最后保存的|身份证|韦尔| MethodDescription.ID | MethodDescription.Name | MethodDescription.Tables | MethodDescription.Sections | MethodDescription.Parameters | MethodDescription.Sections.ID |MethodDescription.Sections,.Name| MethodDescription.Sections.Tables | MethodDescription.Sections.Sections | MethodDescription.Sections.Parameters |
| - -----|- -----|- -----|- -----|- -----|- -----|- -----|- -----|- -----|- -----|- -----|- -----|- -----|- -----|- -----|- -----|- -----|- -----|- -----|- -----|- -----|- -----|- -----|- -----|- -----|- -----|- -----|- -----|- -----|- -----|- -----|
| 0| 44547fcf-a26e-49a1-a| 62abc4ca-bf13-4f27-b| 30cc3932-ef93-4aec-b| 2022-0722-1502-14947|名称氢定量|pmx方法|0000000 - 0000-0000-0| 2019-03-27 22:01:27.|系统|2022-10-12 04:17:32.|用户名|[]|通用|GC/LC定量|无|名称氢定量|/project/path/|2022-10-14 05:25:42.|粤ICP备16036666号B| 0|无|方法|[]|['ID':'方法_信息|[]|南|南|南|南|南|
| 0| 44547fcf-a26e-49a1-a| 62abc4ca-bf13-4f27-b| 30cc3932-ef93-4aec-b| 2022-0722-1502-14947|名称氢定量|pmx方法|0000000 - 0000-0000-0| 2019-03-27 22:01:27.|系统|2022-10-12 04:17:32.|用户名|[]|通用|GC/LC定量|无|名称氢定量|/project/path/|2022-10-14 05:25:42.|粤ICP备16036666号B| 0|无|方法|[]|['ID':'方法_信息|[]|方法_信息|方法信息|[]|[]|['ID':'ID_Method_O|
| 0| 44547fcf-a26e-49a1-a| 62abc4ca-bf13-4f27-b| 30cc3932-ef93-4aec-b| 2022-0722-1502-14947|名称氢定量|pmx方法|0000000 - 0000-0000-0| 2019-03-27 22:01:27.|系统|2022-10-12 04:17:32.|用户名|[]|通用|GC/LC定量|无|名称氢定量|/project/path/|2022-10-14 05:25:42.|粤ICP备16036666号B| 0|无|方法|[]|['ID':'方法_信息|[]|默认链|方法参数|[]|['ID':'SignalAlign| []|
| 0| 44547fcf-a26e-49a1-a| 62abc4ca-bf13-4f27-b| 30cc3932-ef93-4aec-b| 2022-0722-1502-14947|名称氢定量|pmx方法|0000000 - 0000-0000-0| 2019-03-27 22:01:27.|系统|2022-10-12 04:17:32.|用户名|[]|通用|GC/LC定量|无|名称氢定量|/project/path/|2022-10-14 05:25:42.|粤ICP备16036666号B| 0|无|方法|[]|['ID':'方法_信息|[]|方法_校准Cu|校准曲线|['ID':'校准|[]|[]|
| 0| 44547fcf-a26e-49a1-a| 62abc4ca-bf13-4f27-b| 30cc3932-ef93-4aec-b| 2022-0722-1502-14947|名称氢定量|pmx方法|0000000 - 0000-0000-0| 2019-03-27 22:01:27.|系统|2022-10-12 04:17:32.|用户名|[]|通用|GC/LC定量|无|名称氢定量|/project/path/|2022-10-14 05:25:42.|粤ICP备16036666号B| 0|无|方法|[]|['ID':'方法_信息|[]|方法_参考Chro|参比色谱图|['ID':'参考Ch| []|[]|

31moq8wy

31moq8wy1#

使用以下玩具数据框:

import pandas as pd

df = pd.DataFrame(
    {
        "A": [
            1,
            2,
            3,
        ],
        "B": ["a", "b", "c"],
        "C": [{"U": 60, "V": 94}, {"U": 72, "V": 18}, {"U": 78, "V": 86}],
        "D": [
            [{"W": 88, "X": 21}, {"Y": 12, "Z": 63}],
            [{"W": 11, "X": 46}, {"Y": 49, "Z": 22}],
            [{"W": 15, "X": 9}, {"Y": 89, "Z": 17}],
        ],
    }
)

print(df)
# Output

   A  B                   C                                         D
0  1  a  {'U': 60, 'V': 94}  [{'W': 88, 'X': 21}, {'Y': 12, 'Z': 63}]
1  2  b  {'U': 72, 'V': 18}  [{'W': 11, 'X': 46}, {'Y': 49, 'Z': 22}]
2  3  c  {'U': 78, 'V': 86}   [{'W': 15, 'X': 9}, {'Y': 89, 'Z': 17}]

下面是另一种方法:

for col in df.columns:
    # Deal with columns containing lists of json
    if df[col].apply(lambda x: isinstance(x, list)).all():
        df = df.explode(col, ignore_index=True)
    # Deal with json
    if not (new_cols := pd.json_normalize(df[col])).empty:
        df = pd.concat([df.drop(columns=col), new_cols], axis=1)

然后:

A  B   U   V     W     X     Y     Z
0  1  a  60  94  88.0  21.0   NaN   NaN
1  1  a  60  94   NaN   NaN  12.0  63.0
2  2  b  72  18  11.0  46.0   NaN   NaN
3  2  b  72  18   NaN   NaN  49.0  22.0
4  3  c  78  86  15.0   9.0   NaN   NaN
5  3  c  78  86   NaN   NaN  89.0  17.0

相关问题