执行逐行解聚合时字段名到值的棘手转换(使用Pandas)

vlurs2pr  于 2023-04-10  发布在  其他
关注(0)|答案(3)|浏览(127)

我有一个数据集,我想将特定的字段名称转换为值,同时将值分解为自己的唯一行,并执行长透视。

数据

Start       Date        End         Area    Final       Type    Middle Stat Low Stat    High Stat  Middle Stat1 Low Stat1    High Stat1
8/1/2013    9/1/2013    10/1/2013   NY      3/1/2023    CC      226         20          10         0             0            0
8/1/2013    9/1/2013    10/1/2013   CA      3/1/2023    AA      130         50          0          0             0            0




data = {
    "Start": ['8/1/2013', '8/1/2013'],
    "Date": ['9/1/2013', '9/1/2013'],
    "End": ['10/1/2013', '10/1/2013'],
    "Area": ['NY', 'CA'],
    "Final": ['3/1/2023', '3/1/2023'],
    "Type": ['CC', 'AA'],
    "Middle Stat": [226, 130],
    "Low Stat": [20, 50],
    "High Stat": [10, 0],
    "Middle Stat1": [0, 0],
    "Low Stat1": [0, 0],
    "High Stat1": [0, 0]
}

期望

Start       Date        End         Area    Final       Type    Stat    Range   Stat1
8/1/2013    9/1/2013    10/1/2013   NY      3/1/2023    CC      20      Low     0
8/1/2013    9/1/2013    10/1/2013   CA      3/1/2023    AA      50      Low     0
8/1/2013    9/1/2013    10/1/2013   NY      3/1/2023    CC      226     Middle  0
8/1/2013    9/1/2013    10/1/2013   CA      3/1/2023    AA      130     Middle  0
8/1/2013    9/1/2013    10/1/2013   NY      3/1/2023    CC      10      High    0
8/1/2013    9/1/2013    10/1/2013   CA      3/1/2023    AA      0       High    0

我相信我必须注入某种从宽到长的方法,(SO成员协助)但是不确定如何将其合并,同时在目标(感兴趣的列)列名中具有相同的后缀。

pd.wide_to_long(df, 
                stubnames=['Low','Middle','High'],
                i=['Start','Date','End','Area','Final'],
                j='',
                sep=' ',
                suffix='(stat)'
).unstack(level=-1, fill_value=0).stack(level=0).reset_index()

任何建议都很感激。

#原始数据集

import pandas as pd

# create DataFrame
data = {'Start': ['9/1/2013', '10/1/2013', '11/1/2013', '12/1/2013'],
        'Date': ['10/1/2016', '11/1/2016', '12/1/2016', '1/1/2017'],
        'End': ['11/1/2016', '12/1/2016', '1/1/2017', '2/1/2017'],
        'Area': ['NY', 'NY', 'NY', 'NY'],
        'Final': ['3/1/2023', '3/1/2023', '3/1/2023', '3/1/2023'],
        'Type': ['CC', 'CC', 'CC', 'CC'],
        'Low Stat': ['', '', '', ''],
        'Low Stat1': ['', '', '', ''],
        'Middle Stat': ['0', '0', '0', '0'],
        'Middle Stat1': ['0', '0', '0', '0'],
        'Re': ['','','',''],
        'Set': ['0', '0', '0', '0'],
        'Set2': ['0', '0', '0', '0'],
        'Set3': ['0', '0', '0', '0'],
        'High Stat': ['', '', '', ''],
        'High Stat1': ['', '', '', '']}

df = pd.DataFrame(data)
11dmarpk

11dmarpk1#

df.melt(id_vars=df.columns[:6], value_name='Values')
Start      Date        End Area     Final Type     variable  Values
0  8/1/2013  9/1/2013  10/1/2013   NY  3/1/2023   CC  Middle Stat    226
1  8/1/2013  9/1/2013  10/1/2013   CA  3/1/2023   AA  Middle Stat    130
2  8/1/2013  9/1/2013  10/1/2013   NY  3/1/2023   CC    Low Stat      20
3  8/1/2013  9/1/2013  10/1/2013   CA  3/1/2023   AA    Low Stat      50
4  8/1/2013  9/1/2013  10/1/2013   NY  3/1/2023   CC   High Stat      10
5  8/1/2013  9/1/2013  10/1/2013   CA  3/1/2023   AA   High Stat       0
rt4zxlrg

rt4zxlrg2#

一个选项是使用pivot_longer from pyjanitor-在这种情况下,我们使用特殊的占位符.value来标识列中我们希望保留为标题的部分,而其余部分则被整理到新列中:

# pip install pyjanitor
import pandas as pd
import janitor

(df
.pivot_longer(
    index = slice('Start', 'Type'), 
    names_to = ("Range", ".value"), 
    names_sep = " ")
)
      Start      Date        End Area     Final Type   Range  Stat  Stat1
0  8/1/2013  9/1/2013  10/1/2013   NY  3/1/2023   CC  Middle   226      0
1  8/1/2013  9/1/2013  10/1/2013   CA  3/1/2023   AA  Middle   130      0
2  8/1/2013  9/1/2013  10/1/2013   NY  3/1/2023   CC     Low    20      0
3  8/1/2013  9/1/2013  10/1/2013   CA  3/1/2023   AA     Low    50      0
4  8/1/2013  9/1/2013  10/1/2013   NY  3/1/2023   CC    High    10      0
5  8/1/2013  9/1/2013  10/1/2013   CA  3/1/2023   AA    High     0      0
kx5bkwkv

kx5bkwkv3#

您可以尝试先重命名列:

import re

df = df.rename(columns=lambda x: re.sub(r'(Low|Middle|High) Stat', r'Stat\1', x))

x = pd.wide_to_long(df,
                stubnames='Stat',
                i=['Start','Date','End','Area','Final'],
                j='Range', suffix=r'(?:Low|Middle|High)').reset_index()
print(x)

图纸:

Start      Date        End Area     Final   Range Type  Stat
0  8/1/2013  9/1/2013  10/1/2013   NY  3/1/2023  Middle   CC   226
1  8/1/2013  9/1/2013  10/1/2013   NY  3/1/2023     Low   CC    20
2  8/1/2013  9/1/2013  10/1/2013   NY  3/1/2023    High   CC    10
3  8/1/2013  9/1/2013  10/1/2013   CA  3/1/2023  Middle   AA   130
4  8/1/2013  9/1/2013  10/1/2013   CA  3/1/2023     Low   AA    50
5  8/1/2013  9/1/2013  10/1/2013   CA  3/1/2023    High   AA     0

编辑:要有更多的存根名称:

import re

df = df.rename(columns=lambda x: re.sub(r"(Low|Middle|High) Stat$", r"Stat\1", x))
df = df.rename(columns=lambda x: re.sub(r"(Low|Middle|High) Stat1$", r"1Stat\1", x))

x = (
    pd.wide_to_long(
        df,
        stubnames=["Stat", "1Stat"],
        i=["Start", "Date", "End", "Area", "Final"],
        j="Range",
        suffix=r"(?:Low|Middle|High)",
    )
    .reset_index()
    .rename(columns={"1Stat": "Stat1"})
)

图纸:

Start      Date        End Area     Final   Range Type  Stat  Stat1
0  8/1/2013  9/1/2013  10/1/2013   NY  3/1/2023  Middle   CC   226      0
1  8/1/2013  9/1/2013  10/1/2013   NY  3/1/2023     Low   CC    20      0
2  8/1/2013  9/1/2013  10/1/2013   NY  3/1/2023    High   CC    10      0
3  8/1/2013  9/1/2013  10/1/2013   CA  3/1/2023  Middle   AA   130      0
4  8/1/2013  9/1/2013  10/1/2013   CA  3/1/2023     Low   AA    50      0
5  8/1/2013  9/1/2013  10/1/2013   CA  3/1/2023    High   AA     0      0

相关问题