将pandas DataFrame列展开为多行

wqnecbli  于 2023-05-12  发布在  其他
关注(0)|答案(8)|浏览(154)

如果我有一个DataFrame,使得:

pd.DataFrame( {"name" : "John", 
               "days" : [[1, 3, 5, 7]]
              })

给出了这个结构:

days  name
0  [1, 3, 5, 7]  John

如何将其扩展到以下内容?

days  name
0     1  John
1     3  John
2     5  John
3     7  John
vatpfxk5

vatpfxk51#

您可以使用df.itertuples来遍历每一行,并使用列表解析将数据重新塑造为所需的形式:

import pandas as pd

df = pd.DataFrame( {"name" : ["John", "Eric"], 
               "days" : [[1, 3, 5, 7], [2,4]]})
result = pd.DataFrame([(d, tup.name) for tup in df.itertuples() for d in tup.days])
print(result)

产量

0     1
0  1  John
1  3  John
2  5  John
3  7  John
4  2  Eric
5  4  Eric

Divakar's solutionusing_repeat,最快:

In [48]: %timeit using_repeat(df)
1000 loops, best of 3: 834 µs per loop

In [5]: %timeit using_itertuples(df)
100 loops, best of 3: 3.43 ms per loop

In [7]: %timeit using_apply(df)
1 loop, best of 3: 379 ms per loop

In [8]: %timeit using_append(df)
1 loop, best of 3: 3.59 s per loop

下面是用于上述基准测试的设置:

import numpy as np
import pandas as pd

N = 10**3
df = pd.DataFrame( {"name" : np.random.choice(list('ABCD'), size=N), 
                    "days" : [np.random.randint(10, size=np.random.randint(5))
                              for i in range(N)]})

def using_itertuples(df):
    return  pd.DataFrame([(d, tup.name) for tup in df.itertuples() for d in tup.days])

def using_repeat(df):
    lens = [len(item) for item in df['days']]
    return pd.DataFrame( {"name" : np.repeat(df['name'].values,lens), 
                          "days" : np.concatenate(df['days'].values)})

def using_apply(df):
    return (df.apply(lambda x: pd.Series(x.days), axis=1)
            .stack()
            .reset_index(level=1, drop=1)
            .to_frame('day')
            .join(df['name']))

def using_append(df):
    df2 = pd.DataFrame(columns = df.columns)
    for i,r in df.iterrows():
        for e in r.days:
            new_r = r.copy()
            new_r.days = e
            df2 = df2.append(new_r)
    return df2
ebdffaop

ebdffaop2#

自pandas 0.25以来的新功能可以使用函数explode()
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.explode.html

import pandas as pd
df = pd.DataFrame( {"name" : "John", 
               "days" : [[1, 3, 5, 7]]})

print(df.explode('days'))

印刷品

name days
0  John    1
0  John    3
0  John    5
0  John    7
nnsrf1az

nnsrf1az3#

这里有一些关于NumPy的东西-

lens = [len(item) for item in df['days']]
df_out = pd.DataFrame( {"name" : np.repeat(df['name'].values,lens), 
               "days" : np.hstack(df['days'])
              })

@unutbu's solution中所指出的,np.concatenate(df['days'].values)将比np.hstack(df['days'])快。
它使用循环解析来提取每个'days'元素的长度,这必须是最小的运行时方式。
样品运行-

>>> df
           days  name
0  [1, 3, 5, 7]  John
1        [2, 4]  Eric
>>> lens = [len(item) for item in df['days']]
>>> pd.DataFrame( {"name" : np.repeat(df['name'].values,lens), 
...                "days" : np.hstack(df['days'])
...               })
   days  name
0     1  John
1     3  John
2     5  John
3     7  John
4     2  Eric
5     4  Eric
ds97pgxw

ds97pgxw4#

一个'原生' pandas解决方案-我们将列解栈成一个系列,然后基于索引连接回来:

import pandas as pd #import
x2 = x.days.apply(lambda x: pd.Series(x)).unstack() #make an unstackeded series, x2
x.drop('days', axis = 1).join(pd.DataFrame(x2.reset_index(level=0, drop=True))) #drop the days column, join to the x2 series
gijlo24d

gijlo24d5#

另一种解决方案:

In [139]: (df.apply(lambda x: pd.Series(x.days), axis=1)
   .....:    .stack()
   .....:    .reset_index(level=1, drop=1)
   .....:    .to_frame('day')
   .....:    .join(df['name'])
   .....: )
Out[139]:
   day  name
0    1  John
0    3  John
0    5  John
0    7  John
mxg2im7a

mxg2im7a6#

大概是这样的:

df2 = pd.DataFrame(columns = df.columns)
for i,r in df.iterrows():
    for e in r.days:
        new_r = r.copy()
        new_r.days = e
        df2 = df2.append(new_r)
df2
qvk1mo1f

qvk1mo1f7#

感谢Divakar's solution,将其作为 Package 器函数来扁平化列,处理np.nan和具有多个列的DataFrames

def flatten_column(df, column_name):
     repeat_lens = [len(item) if item is not np.nan else 1 for item in df[column_name]]
     df_columns = list(df.columns)
     df_columns.remove(column_name)
     expanded_df = pd.DataFrame(np.repeat(df.drop(column_name, axis=1).values, repeat_lens, axis=0), columns=df_columns)
     flat_column_values = np.hstack(df[column_name].values)
     expanded_df[column_name] = flat_column_values
     expanded_df[column_name].replace('nan', np.nan, inplace=True)
     return expanded_df
xkftehaa

xkftehaa8#

如果你在这里结束,搜索多列的解决方案:

import pandas as pd
df = pd.DataFrame( {"name" : "John", 
               "days" : [[1, 3, 5, 7]]
               "values": [[10,20,30,40]]
              })
print(df)

           days  name values
0  [1, 3, 5, 7]  John [10,20,30,40]

命令:

print(df.explode(list(('days', 'values'))

将给予输出:

name days values
0  John    1 10
0  John    3 20
0  John    5 30
0  John    7 40

请注意,“展开”列必须具有相同的长度。

相关问题