pandas 如何将一个 Dataframe 中的值放入另一个 Dataframe ?

ni65a41a  于 2023-01-15  发布在  其他
关注(0)|答案(2)|浏览(356)

输入 Dataframe 如下所示

options text
1   Stem    I am good
2   Option1 Yes
3   Option2 No
4   Option3 Maybe
5   Option4 ok

输出 Dataframe 应如下所示-

Stem   Stem_Value  All_4_Options_Appended
Stem1   I am Good   'Yes','No','Maybe','ok'

我有超过百万个这样的项目,所以正在寻找一个功能类的东西

wnavrhmk

wnavrhmk1#

这里的主要技巧是,当你能够创建g,为所需的行创建一个分组列/序列之后,你可以把每个组中的所有text值组合成一个列表,然后你可以用一个矢量化方法.str.join(' ')来组合它们。

    • 在较小的数据集上,此方法应比.agg.apply方法(编辑)更快**

尝试以下步骤。(* 一步一步-评论中提到的解释 *)-

s = 'Stem'                                            # Start group for string
g = df['options'].eq(s).cumsum()                      # Create groups based cumsum
o = df.groupby(g)['text'].apply(list).str.join(' ')   # Groupby and combine text to list of texts
o = o.reset_index()                                   # Reset index to get group column
o['options'] = s + o['options'].astype(str)           # Prefix column with Stem
o.columns = ['Stems','All_4_options_Appended']        # Change column names
print(o)
Stems                             All_4_options_Appended
0  Stem1  It's the beginning of the quarter, and you're ...
1  Stem2  It's the beginning of the quarter, and you're ...

基准

@Akshay Sehgal提供的解决方案

%%timeit

s = 'Stem'
g = df['options'].eq(s).cumsum()
o = df.groupby(g)['text'].apply(list).str.join(' ')
o = o.reset_index()
o['options'] = s + o['options'].astype(str)
o.columns = ['Stems','All_4_options_Appended']
o

#686 µs ± 14.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

@Mozway解决方案

%%timeit

m = df['options'].eq('Stem')

out = (df.groupby(m.cumsum().astype(str).radd('Stem'))
         .agg(All_4_Options_Appended=('text', ';'.join))
         .rename_axis('Stems').reset_index()
       )

out

#1.44 ms ± 8.22 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

EDIT:运行以上设置-

d = {
  'options': ['Stem', 'Option1', 'Option2', 'Option3', 'Option4', 'Stem', 'Option1', 'Option2', 'Option3', 'Option4' ],
  'text': ['I am good', 'Yes', 'No', 'Maybe', 'ok', 'Not great', 'No', 'No', 'Yes', 'No',]
 }
df = pd.DataFrame(d)

s = 'Stem'
g = df['options'].eq(s).cumsum()
o = df.groupby(g)['text'].apply(list).str.join(' ')
o = o.reset_index()
o['options'] = s + o['options'].astype(str)
o.columns = ['Stems','All_4_options_Appended']
print(o)
Stems     All_4_options_Appended
0  Stem1  I am good Yes No Maybe ok
1  Stem2     Not great No No Yes No

根据您预期的输出,您似乎要求每个字符串都是独立的。所以这个版本可能更适合您。

s = 'Stem'
g = df['options'].eq(s).cumsum()
o = df.groupby(g)['text'].apply(list)         #<--- modified here
o = o.reset_index()
o['options'] = s + o['options'].astype(str)
o.columns = ['Stems','All_4_options_Appended']
print(o)
Stems           All_4_options_Appended
0  Stem1  [I am good, Yes, No, Maybe, ok]
1  Stem2     [Not great, No, No, Yes, No]
b4lqfgs4

b4lqfgs42#

使用具有groupby.aggcumsum的自定义groupby聚合来生成公共grouper

m = df['options'].eq('Stem')

out = (df[~m].groupby(m.cumsum().astype(str).radd('Stem'))
         .agg(All_4_Options_Appended=('text', ','.join))
         .rename_axis('Stems').reset_index()
       )

输出(原始示例):

Stems                             All_4_Options_Appended
0  Stem1  Describe the most expensive option first, and ...
1  Stem2  Describe the most expensive option first, and ...

输出(新示例):

Stems All_4_Options_Appended
0  Stem1        Yes,No,Maybe,ok

相关问题