pandas panda 'pd.melt'多指数色谱柱的使用

cyej8jka  于 2022-12-02  发布在  其他
关注(0)|答案(3)|浏览(125)

bounty将在3天后过期。回答此问题可获得+200声望奖励。freebie希望吸引更多人关注此问题。

我在尝试编写可理解的pandas时遇到了麻烦,这让我觉得我缺少了一些特性或用法(可能是pd.melt方法)。
我想合并两个数据集,它们都很相似:

  • time,表示状态变更的时间
  • nameinstance是用于唯一地标识事物的记录条目的复合标识。
  • 最后是一个值,它是为那个事物在那个时间改变的状态命名的。

因此,我想合并的每个数据集中的示例记录如下:

  1. dict(time=0, name="a", instance=0, state=1)
  2. dict(time=5, name="a", instance=0, location="london")
    我想将这两个记录集合并为一个记录集,其中每个(name, instance)在每个时间都有最后一个已知的statelocation
[
    dict(time=0, name="a", instance=0, state=1, location=np.nan),
    dict(time=5, name="a", instance=0, state=1, location="london"),
]

为了达到这个目的,我目前做了pd.DataFrame.pivot_tablepd.DataFrame.ffillpd.DataFrame.meltpd.DataFrame.reset_index的组合。它似乎按预期工作,但感觉非常麻烦/不可读,特别是当我开始使用pd.DataFrame.melt时。
我觉得我遗漏了pd.DataFrame.melt函数的一些用法,但我不确定如何将文档应用于我正在使用的具有pd.MultiIndex列的数据集,或者我是否遗漏了我应该使用的其他pandas实用程序。
如果发现melt不是我应该使用的,我会用更合适的内容更新问题标题。
这是我的资料

import pandas as pd

states = [
    dict(time=0, name="a", instance=0, state=0),
    dict(time=0, name="a", instance=1, state=0),
    dict(time=0, name="a", instance=2, state=0),
    dict(time=0, name="b", instance=1, state=0),
    dict(time=0, name="b", instance=2, state=0),
    dict(time=1, name="a", instance=1, state=1),
    dict(time=2, name="a", instance=2, state=1),
    dict(time=2, name="b", instance=1, state=1),
]

locations = [
    dict(time=0, name="a", instance=0, location="tokyo"),
    dict(time=0, name="a", instance=1, location="tokyo"),
    dict(time=0, name="a", instance=2, location="tokyo"),
    dict(time=0, name="b", instance=1, location="tokyo"),
    dict(time=0, name="b", instance=2, location="tokyo"),
    dict(time=1, name="a", instance=0, location="london"),
    dict(time=1, name="a", instance=2, location="london"),
    dict(time=1, name="b", instance=1, location="london"),
    dict(time=1, name="b", instance=2, location="london"),
    dict(time=1, name="a", instance=1, location="paris"),
    dict(time=2, name="a", instance=2, location="paris"),
    dict(time=2, name="b", instance=1, location="paris"),
]

states = pd.DataFrame.from_dict(states)
locations = pd.DataFrame.from_dict(locations)

combined = pd.concat([states, locations], axis="index")
combined = combined.pivot_table(
    index="time",
    columns=["name", "instance"],
    values=["state", "location"],
    aggfunc="last",
)
combined = combined.ffill()

ugly_melt = combined.melt(ignore_index=False)
ugly_melt = ugly_melt.rename(columns={None: "state_status"})
ugly_melt = (
    ugly_melt.reset_index()
    .pivot(
        index=["time", "name", "instance"],
        columns=["state_status"],
        values="value",
    )
    .reset_index()
)
print(ugly_melt)
lxkprmvk

lxkprmvk1#

请注意,combinedffill之后看起来是这样的。

location                                state                    
name            a                      b             a              b     
instance        0      1       2       1       2     0    1    2    1    2
time                                                                      
0           tokyo  tokyo   tokyo   tokyo   tokyo   0.0  0.0  0.0  0.0  0.0
1          london  paris  london  london  london   0.0  1.0  0.0  0.0  0.0
2          london  paris   paris   paris  london   0.0  1.0  1.0  1.0  0.0

有两列,locationstate。您可以分别对它们执行melt操作,然后再对它们执行merge操作。

melt1 = pd.melt(combined["location"], value_name="location", ignore_index=False).reset_index()
melt2 = pd.melt(combined["state"], value_name="state", ignore_index=False).reset_index()
better_melt = melt1.merge(melt2, on=["time", "name", "instance"])
better_melt

    time name  instance location  state
0      0    a         0    tokyo    0.0
1      1    a         0   london    0.0
2      2    a         0   london    0.0
3      0    a         1    tokyo    0.0
4      1    a         1    paris    1.0
5      2    a         1    paris    1.0
6      0    a         2    tokyo    0.0
7      1    a         2   london    0.0
8      2    a         2    paris    1.0
9      0    b         1    tokyo    0.0
10     1    b         1   london    0.0
11     2    b         1    paris    1.0
12     0    b         2    tokyo    0.0
13     1    b         2   london    0.0
14     2    b         2   london    0.0
roejwanj

roejwanj2#

数据集:

import pandas as pd

states = [
    dict(time=0, name="a", instance=0, state=0),
    dict(time=0, name="a", instance=1, state=0),
    dict(time=0, name="a", instance=2, state=0),
    dict(time=0, name="b", instance=1, state=0),
    dict(time=0, name="b", instance=2, state=0),
    dict(time=1, name="a", instance=1, state=1),
    dict(time=2, name="a", instance=2, state=1),
    dict(time=2, name="b", instance=1, state=1),
]

locations = [
    dict(time=0, name="a", instance=0, location="tokyo"),
    dict(time=0, name="a", instance=1, location="tokyo"),
    dict(time=0, name="a", instance=2, location="tokyo"),
    dict(time=0, name="b", instance=1, location="tokyo"),
    dict(time=0, name="b", instance=2, location="tokyo"),
    dict(time=1, name="a", instance=0, location="london"),
    dict(time=1, name="a", instance=2, location="london"),
    dict(time=1, name="b", instance=1, location="london"),
    dict(time=1, name="b", instance=2, location="london"),
    dict(time=1, name="a", instance=1, location="paris"),
    dict(time=2, name="a", instance=2, location="paris"),
    dict(time=2, name="b", instance=1, location="paris"),
]

看起来时间、名称和示例是用于索引数据的级别,使用set_index将它们添加到索引中是有意义的:

states = (
    pd.DataFrame.from_dict(states)
    .set_index(["time", "name", "instance"])
)
locations = (
    pd.DataFrame.from_dict(locations)
    .set_index(["time", "name", "instance"])
)

一旦MultiIndex就绪,您就可以沿着数据行串连状态和位置。这会在状态数据行中留下一些NaN。请先排序索引,让项目先依时间排列,然后根据相同的名称和执行严修将数据分组,最后执行向前填满。

combined = (
    pd.concat([states, locations], axis=1)
    .sort_index()
    .groupby(["name", "instance"])
    .ffill()
)

结果和你的略有不同。我得到的是:

location state
time name instance      
0    a    0        tokyo    0.0
          1        tokyo    0.0
          2        tokyo    0.0
     b    1        tokyo    0.0
          2        tokyo    0.0
1    a    0        london   0.0
          1        paris    1.0
          2        london   0.0
     b    1        london   0.0
          2        london   0.0
2    a    2        paris    1.0
     b    1        paris    1.0

您将获得:

state_status time name instance location state
0            0    a    0        tokyo    0.0
1            0    a    1        tokyo    0.0
2            0    a    2        tokyo    0.0
3            0    b    1        tokyo    0.0
4            0    b    2        tokyo    0.0
5            1    a    0        london   0.0
6            1    a    1        paris    1.0
7            1    a    2        london   0.0
8            1    b    1        london   0.0
9            1    b    2        london   0.0
10           2    a    0        london   0.0
11           2    a    1        paris    1.0
12           2    a    2        paris    1.0
13           2    b    1        paris    1.0
14           2    b    2        london   0.0

首先,我得到了一个MultiIndex DataFrame。如果您不喜欢它,可以使用reset_index()。(时间、名称、示例)=(2,a,0)。没有一个输入数据具有这种值的组合,所以这就是为什么它没有出现在我的结果中。它出现在你的结果中是因为pivot_table的工作方式。这可能是一个可取的行为,也可能不是,由你来决定。

watbbzwu

watbbzwu3#

您可以简单地对两个数据集进行连接和过滤,记住time约束(只能对先前/当前状态进行连接,而不能对未来状态进行连接)。
这消除了执行正向填充ffill()的需要,并且使用连接和过滤器比pd.melt方法更容易理解。
初始化代码

states = [
    dict(time=0, name="a", instance=0, state=0),
    dict(time=0, name="a", instance=1, state=0),
    dict(time=0, name="a", instance=2, state=0),
    dict(time=0, name="b", instance=1, state=0),
    dict(time=0, name="b", instance=2, state=0),
    dict(time=1, name="a", instance=1, state=1),
    dict(time=2, name="a", instance=2, state=1),
    dict(time=2, name="b", instance=1, state=1),
]

locations = [
    dict(time=0, name="a", instance=0, location="tokyo"),
    dict(time=0, name="a", instance=1, location="tokyo"),
    dict(time=0, name="a", instance=2, location="tokyo"),
    dict(time=0, name="b", instance=1, location="tokyo"),
    dict(time=0, name="b", instance=2, location="tokyo"),
    dict(time=1, name="a", instance=0, location="london"),
    dict(time=1, name="a", instance=2, location="london"),
    dict(time=1, name="b", instance=1, location="london"),
    dict(time=1, name="b", instance=2, location="london"),
    dict(time=1, name="a", instance=1, location="paris"),
    dict(time=2, name="a", instance=2, location="paris"),
    dict(time=2, name="b", instance=1, location="paris"),
]

实施

import pandas as pd

"""
Steps:
1. Convert to dataframe (Rename state time as state_time, keep location time as time)
2. Merge both dataframe together
3. Filter state time <= location time (since location uses current/previous state)
4. Filter for latest state time (since location must remember the latest state and not all previous states)
"""

# Step 1
states = pd.DataFrame(states).rename(columns={"time": "state_time"})
locations = pd.DataFrame(locations)

# Step 2
merged_df = pd.merge(locations, states, on=["name", "instance"])

# Step 3
merged_df = merged_df[merged_df["state_time"] <= merged_df["time"]]

# Step 4
merged_df = merged_df\
    .sort_values(["time", "name", "instance", "state_time"])\
    .drop_duplicates(["time", "name", "instance"], keep="last")\
    .reset_index(drop=True)\
    .drop(columns=["state_time"])

这将产生以下merged_df

time name  instance location  state
0      0    a         0    tokyo      0
1      0    a         1    tokyo      0
2      0    a         2    tokyo      0
3      0    b         1    tokyo      0
4      0    b         2    tokyo      0
5      1    a         0   london      0
6      1    a         1    paris      1
7      1    a         2   london      0
8      1    b         1   london      0
9      1    b         2   london      0
10     2    a         2    paris      1
11     2    b         1    paris      1

结果的长度来自location数据,如果你想让每个名称-示例-位置都有一个时间,你可以事先做一个外部连接。

相关问题