python 从pandas df到list生成分层数据

nkkqxpd9  于 2023-05-05  发布在  Python
关注(0)|答案(2)|浏览(116)

我有这个表格的数据

data = [
    [2019, "July", 8, '1.2.0', 7.0, None, None, None],
    [2019, "July", 10, '1.2.0', 52.0, "Breaking", 6.0, 'Path Removed w/o Deprecation'],
    [2019, "July", 15, "0.1.0", 210.0, "Breaking", 57.0, 'Request Parameter Removed'],
    [2019, 'August', 20, '2.0.0', 100.0, "Breaking", None, None],
    [2019, 'August', 25, '2.0.0', 200.0, 'Non-breaking', None, None],
]

该列表按以下层次结构排列:Year, Month, Day, info_version, API_changes, type1, count, content
我想为数据生成这个层次树结构:

{
  "name": "2020", # this is year
  "children": [
    {
      "name": "July", # this is month
      "children": [
        {
          "name": "10",   #this is day
          "children": [
            {
              "name": "1.2.0",   # this is info_version
              "value": 52,        # this is value of API_changes(always a number)
              "children": [
                {
                  "name": "Breaking",   # this is type1 column( it is string, it is either Nan or Breaking)
                  "value": 6,                   # this is value of count
                  "children": [
                    {
                      "name": "Path Removed w/o Deprecation",      #this is content column
                      "value": 6        # this is value of count
                    }
                  ]
                }
              ]
            }
          ]
        }
      ]
    }
  ]
}

对于所有其他月份,它继续以相同的格式。我不希望以任何方式修改我的数据,这是它应该如何为我的用例(图形目的)。我不知道如何才能做到这一点,任何建议将不胜感激。
这参考了pyecharts中朝阳图形的格式

qcuzuvrc

qcuzuvrc1#

首先,您需要用所有不同的键创建一个嵌套的dict,然后递归地构建结构

from collections import defaultdict

def to_keys(values):
    if isinstance(values, tuple):
        return {"name": values[0], "value": values[1]}
    return {"name": values}    

def to_children(values):
    if isinstance(values, list):
        return [to_children(item) for item in values]
    if isinstance(values, tuple):
        return to_keys(values)
    if isinstance(values, dict):
        return [{**to_keys(key), "children": to_children(value)}
                for key, value in values.items()]
    raise Exception("invalid type")

gen = lambda: defaultdict(gen)
result = defaultdict(gen)

data = [
    [2019, "July", 10, '1.2.0', 52.0, 'Breaking', 6, None],
    [2019, "July", 10, '1.2.0', 52.0, "Breaking", 6.0, 'Path Removed w/o Deprecation'],
    [2019, "July", 15, "0.1.0", 210.0, "Breaking", 57.0, 'Request Parameter Removed'],
    [2019, 'August', 20, '2.0.0', 100.0, "Breaking", None, None],
    [2019, 'August', 25, '2.0.0', 200.0, 'Non-breaking', None, None],
]

for year, month, day, info_version, api_changes, type1, count, content in data:
    result[year][month][day][(info_version, api_changes)].setdefault((type1, count), []).append((content, count))

final_result = to_children(result)
print(final_result)
k97glaaz

k97glaaz2#

假设header是已知的,并按层次结构排序,其中header的描述必须按如下顺序分组(其用法请参见datetime文档):

from datetime import datetime
hierarchical_description = [
    ([("name", "Year")], lambda d: int(d["name"])),
    ([("name", "Month")], lambda d: datetime.strptime(d["name"], "%B").month),
    ([("name", "Day")], None),
    ([("name", "info_version"), ("value", "API_changes")], None),
    (
        [
            ("name", "type1"),
            ("value", "count"),
        ],
        None,
    ),
    ([("name", "content"), ("value", "count")], None),
]

并且 Dataframe 按如下方式加载:

import pandas as pd

data = [
    [2019, "July", 8, "1.2.0", 7.0, None, None],
    [2019, "July", 10, "1.2.0", 52.0, "Breaking", 6.0, "Path Removed w/o Deprecation"],
    [2019, "July", 15, "0.1.0", 210.0, "Breaking", 57.0, "Request Parameter Removed"],
    [2019, "August", 20, "2.0.0", 100.0, "Breaking", None, None],
    [2019, "August", 25, "2.0.0", 200.0, "Non-breaking", None, None],
]

hierarchical_order = [
    "Year",
    "Month",
    "Day",
    "info_version",
    "API_changes",
    "type1",
    "count",
    "content",
]

df = pd.DataFrame(
    data,
    columns=hierarchical_order,
)

可以创建一个递归方法,该方法分层进入 Dataframe :

def logical_and_df(df, conditions):
    if len(conditions) == 0:
        return df
    colname, value = conditions[0]
    return logical_and_df(df[df[colname] == value], conditions[1:])

def get_hierarchical_data(df, description):
    if len(description) == 0:
        return []

    children = []
    parent_description, sorting_function_key = description[0]
    for colvalues, subdf in df.groupby([colname for _, colname in parent_description]):
        attributes = {
            key: value for (key, _), value in zip(parent_description, colvalues)
        }
        grand_children = get_hierarchical_data(
            logical_and_df(
                subdf,
                [
                    (colname, value)
                    for (_, colname), value in zip(parent_description, colvalues)
                ],
            ),
            description[1:],
        )
        if len(grand_children) > 0:
            attributes["children"] = grand_children

        children.append(attributes)

    if sorting_function_key is None:
        return children
    return sorted(children, key=sorting_function_key)

方法 logical_and 接受一个dataframe和一个条件列表。条件是一对,其中左侧成员是列名,右侧成员是该列上的值。
递归方法 get_hierarchical_data 将分层描述作为输入。描述,是一个元组的列表。每个元组由一个列表组成,该列表指示 namevalue 列和一个可选的排序键方法,该方法将用于对子列表进行排序。该方法返回子元素,其中value / name基于描述中的第一个元素。如果描述为空,则返回一个空的子级列表。否则,它使用pandas中的groupby方法来查找唯一对(see this post)。创建一个名称、值字典,并将其与查找子对象的方法的递归调用连接起来。
下面几行帮助您打印字典:

import json
print(json.dumps(get_hierarchical_data(df, hierarchical_description), indent=5))
首发版本

我的第一个版本并不是针对分组列的问题。我把这篇文章编辑成了这个新版本,应该可以解决你的问题。

相关问题