pandas 如何从名称包含在另一个系列中的嵌套框架中选择列?

gcuhipw9  于 2023-09-29  发布在  其他
关注(0)|答案(3)|浏览(103)

我有一个系列A,它看起来像:

data = {'Animal':['a.Bear', 'b.Elephant', '123.Giraffe', 'Kangaroo']}
A = pd.DataFrame(data)

    Animal
0   a.Bear
1   b.Elephant
2   123.Giraffe
3   Kangaroo

一个dataframe df

column_names = ['Lion', 'Tiger', 'Bear', 'Elephant', 'Giraffe', 'Kangaroo', 'Rhino', 'Cat', 'Dog']
data = {animal: [random.random() for _ in range(10)] for animal in column_names}
df = pd.DataFrame(data)

Lion     Tiger      Bear  Elephant   Giraffe  Kangaroo     Rhino  \
0  0.435419  0.139088  0.799243  0.095464  0.252427  0.300750  0.537184   
1  0.536742  0.798354  0.359454  0.962717  0.900115  0.192034  0.255388   
2  0.400937  0.999050  0.464974  0.082873  0.807442  0.152231  0.888681   
3  0.962247  0.585496  0.826572  0.964859  0.061535  0.661318  0.626811   
4  0.315054  0.241821  0.183458  0.767684  0.932423  0.605995  0.121704   
5  0.975635  0.321856  0.640700  0.269786  0.603920  0.451022  0.202050   
6  0.281994  0.790526  0.074202  0.318642  0.825572  0.006433  0.376935   
7  0.002314  0.599871  0.883832  0.838671  0.193689  0.983202  0.365913   
8  0.488496  0.226901  0.318186  0.527369  0.722069  0.152814  0.181855   
9  0.059592  0.483801  0.419581  0.378362  0.064484  0.263958  0.183479   

        Cat       Dog  
0  0.457674  0.930943  
1  0.171235  0.465397  
2  0.230023  0.732982  
3  0.094517  0.373322  
4  0.885030  0.852047  
5  0.759202  0.521539  
6  0.683882  0.520186  
7  0.635325  0.832302  
8  0.950867  0.395677  
9  0.929706  0.858686

我想只选择df中名称包含在A系列中的列。
我试过:

df.loc[:,A['Animal].str.contains(df.columns)]

但我得到错误:

TypeError: unhashable type: 'Index'
jdg4fx2g

jdg4fx2g1#

验证码

df.loc[:, df.columns.map(lambda x: x in ' '.join(A['Animal']))]

输出[截断为每个系列4个元素]:

Bear        Elephant    Giraffe     Kangaroo
0   0.794328    0.112836    0.357502    0.156082
1   0.840482    0.025965    0.600463    0.408251
2   0.319205    0.239732    0.890557    0.589371
3   0.616569    0.495843    0.244707    0.748728

中级

' '.join(A['Animal'])

'a.Bear b.Elephant 123.Giraffe Kangaroo'
df.columns.map(lambda x: x in ' '.join(A['Animal']))

Index([False, False, True, True, True, True, False, False, False], dtype='bool')
xriantvc

xriantvc2#

一个选项是预处理系列,然后选择阵列:

df.loc[:, A.Animal.str.split('.').str[-1]]
       Bear  Elephant   Giraffe  Kangaroo
0  0.700352  0.205612  0.102616  0.944342
1  0.890737  0.820959  0.651497  0.479565
2  0.699564  0.531335  0.872938  0.091374
3  0.330110  0.106390  0.612813  0.023788
4  0.438814  0.673884  0.332209  0.858403
5  0.275314  0.225742  0.274267  0.019163
6  0.382985  0.269667  0.412339  0.248712
7  0.803773  0.580038  0.634080  0.859197
8  0.000672  0.231498  0.454456  0.035016
9  0.072687  0.342957  0.300143  0.052512

上面的解决方案对拆分器做了一些假设。更通用的方法是使用列表解析:

filters = [word for word in df 
           for wording in A.Animal.array 
           if word in wording]
df.loc[:, filters]

       Bear  Elephant   Giraffe  Kangaroo
0  0.700352  0.205612  0.102616  0.944342
1  0.890737  0.820959  0.651497  0.479565
2  0.699564  0.531335  0.872938  0.091374
3  0.330110  0.106390  0.612813  0.023788
4  0.438814  0.673884  0.332209  0.858403
5  0.275314  0.225742  0.274267  0.019163
6  0.382985  0.269667  0.412339  0.248712
7  0.803773  0.580038  0.634080  0.859197
8  0.000672  0.231498  0.454456  0.035016
9  0.072687  0.342957  0.300143  0.052512
lztngnrs

lztngnrs3#

我会使用regex来实现:

import re

# craft a regex from existing columns
target = '|'.join(map(re.escape, sorted(df.columns, key=len, reverse=True)))
# 'Elephant|Kangaroo|Giraffe|Tiger|Rhino|Lion|Bear|Cat|Dog'

# extract the names
keep = A['Animal'].str.extract(f'({target})', expand=False)

# slice columns in original order
out = df[df.columns.intersection(keep)
  • 如果您只想匹配完整的单词(例如Cat不应与Catfish匹配),请在extract中使用fr'\b({target})\b。*

输出量:

Bear  Elephant   Giraffe  Kangaroo
0  0.586369  0.150831  0.016725  0.204938
1  0.941386  0.098238  0.769691  0.117735
2  0.985639  0.528000  0.076075  0.122066

相关问题