import pandas as pd
df = pd.DataFrame( {"name" : ["John", "Eric"],
"days" : [[1, 3, 5, 7], [2,4]]})
result = pd.DataFrame([(d, tup.name) for tup in df.itertuples() for d in tup.days])
print(result)
产量
0 1
0 1 John
1 3 John
2 5 John
3 7 John
4 2 Eric
5 4 Eric
In [48]: %timeit using_repeat(df)
1000 loops, best of 3: 834 µs per loop
In [5]: %timeit using_itertuples(df)
100 loops, best of 3: 3.43 ms per loop
In [7]: %timeit using_apply(df)
1 loop, best of 3: 379 ms per loop
In [8]: %timeit using_append(df)
1 loop, best of 3: 3.59 s per loop
下面是用于上述基准测试的设置:
import numpy as np
import pandas as pd
N = 10**3
df = pd.DataFrame( {"name" : np.random.choice(list('ABCD'), size=N),
"days" : [np.random.randint(10, size=np.random.randint(5))
for i in range(N)]})
def using_itertuples(df):
return pd.DataFrame([(d, tup.name) for tup in df.itertuples() for d in tup.days])
def using_repeat(df):
lens = [len(item) for item in df['days']]
return pd.DataFrame( {"name" : np.repeat(df['name'].values,lens),
"days" : np.concatenate(df['days'].values)})
def using_apply(df):
return (df.apply(lambda x: pd.Series(x.days), axis=1)
.stack()
.reset_index(level=1, drop=1)
.to_frame('day')
.join(df['name']))
def using_append(df):
df2 = pd.DataFrame(columns = df.columns)
for i,r in df.iterrows():
for e in r.days:
new_r = r.copy()
new_r.days = e
df2 = df2.append(new_r)
return df2
>>> df
days name
0 [1, 3, 5, 7] John
1 [2, 4] Eric
>>> lens = [len(item) for item in df['days']]
>>> pd.DataFrame( {"name" : np.repeat(df['name'].values,lens),
... "days" : np.hstack(df['days'])
... })
days name
0 1 John
1 3 John
2 5 John
3 7 John
4 2 Eric
5 4 Eric
import pandas as pd #import
x2 = x.days.apply(lambda x: pd.Series(x)).unstack() #make an unstackeded series, x2
x.drop('days', axis = 1).join(pd.DataFrame(x2.reset_index(level=0, drop=True))) #drop the days column, join to the x2 series
In [139]: (df.apply(lambda x: pd.Series(x.days), axis=1)
.....: .stack()
.....: .reset_index(level=1, drop=1)
.....: .to_frame('day')
.....: .join(df['name'])
.....: )
Out[139]:
day name
0 1 John
0 3 John
0 5 John
0 7 John
8条答案
按热度按时间vatpfxk51#
您可以使用
df.itertuples
来遍历每一行,并使用列表解析将数据重新塑造为所需的形式:产量
Divakar's solution,
using_repeat
,最快:下面是用于上述基准测试的设置:
ebdffaop2#
自pandas 0.25以来的新功能可以使用函数
explode()
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.explode.html
印刷品
nnsrf1az3#
这里有一些关于NumPy的东西-
如
@unutbu's solution
中所指出的,np.concatenate(df['days'].values)
将比np.hstack(df['days'])
快。它使用循环解析来提取每个
'days'
元素的长度,这必须是最小的运行时方式。样品运行-
ds97pgxw4#
一个'原生' pandas解决方案-我们将列解栈成一个系列,然后基于索引连接回来:
gijlo24d5#
另一种解决方案:
mxg2im7a6#
大概是这样的:
qvk1mo1f7#
感谢Divakar's solution,将其作为 Package 器函数来扁平化列,处理
np.nan
和具有多个列的DataFramesxkftehaa8#
如果你在这里结束,搜索多列的解决方案:
命令:
将给予输出:
请注意,“展开”列必须具有相同的长度。