python-3.x 检查“列表”中的元素在数据框列中是否可用,如果条件为True,则创建新列以添加值

r1zhe5dt  于 2023-06-25  发布在  Python
关注(0)|答案(3)|浏览(119)

我有一个下面的数据框架和两个列表。我想检查list1中的项目是否在Description列中可用,然后创建一个新列并添加标签“weather”。对于list2,我需要添加标签“equipment”。

list1 = ['wind','air']
list2 = ['crane','machine']

df

Description
There was a heavy wind due to cyclone.
Pollution hamper the air quality.
The machine failure was due to short circuit.
The game was called off due to wind.
Players played the game very well.
the crane operator took the crane to wrong side

期望输出

Description                                            Label
There was a heavy wind due to cyclone.                 weather
Pollution hamper the air quality.                      weather
The machine failure was due to short circuit.          equipment
The game was called off due to wind.                   weather
Players played the game very well.                     Other
the crane operator took the crane to wrong side.       equipment

我尝试了下面的代码,但在最后的数据中,它给了我所有描述的标签“其他”。

df['Description'] = np.where(df['Description'].str.contains('|'.join(list1)),'weather','Other')
df['Description'] = np.where(df['Description'].str.contains('|'.join(list2)),'equipment','Other')
zfycwa2u

zfycwa2u1#

快速简单的代码修复方法是使用numpy.select

df['Label'] = np.select([df['Description'].str.contains('|'.join(list1)),
                         df['Description'].str.contains('|'.join(list2))],
                        ['weather', 'equipment'], 'Other')

但是,你可以做得更好。
您可以使用自动构建的正则表达式和Map字典:

import re

d = {'weather': ['wind','air'], 'equipment': ['crane','machine']}

# reverse d
d2 = {k: v for v,l in d.items() for k in l}
# {'wind': 'weather', 'air': 'weather',
#  'crane': 'equipment', 'machine': 'equipment'}

pattern = f"({'|'.join(map(re.escape, sorted(d2, key=len, reverse=True)))})"
# '(machine|crane|wind|air)'

df['Label'] = (df['Description'].str.extract(pattern, expand=False)
               .map(d2).fillna('Other')
               )

如果每个句子可以有多个匹配项:

df['Label'] = (df['Description'].str.extractall(pattern)[0]
               .groupby(level=0).agg(lambda x: ','.join(x.drop_duplicates()))
              )

输出:

Description      Label
0           There was a heavy wind due to cyclone.    weather
1                Pollution hamper the air quality.    weather
2    The machine failure was due to short circuit.  equipment
3             The game was called off due to wind.    weather
4               Players played the game very well.      Other
5  the crane operator took the crane to wrong side  equipment
cetgtptt

cetgtptt2#

它似乎是这样工作的:

df["Label"] = "Other"
df["Label"][df["Description"].str.contains('|'.join(list1))] = "weather"
df["Label"][df["Description"].str.contains('|'.join(list2))] = "equipment"
ccrfmcuu

ccrfmcuu3#

我将使用.apply()来实现这一点,您可以对填充label列所使用的逻辑进行更多控制:

df['label'] = df['Description'].apply(lambda x: 'weather' if any(word in x for word in list1) else ('equipment' if any(word in x for word in list2) else 'other'))

相关问题