regex 正则表达式捕获组提取句子中的列表元素

sbdsn5lh 于 2023-10-22 发布在其他

关注(0)|答案(3)|浏览(86)

我有一个句子列表，其中一些包含句子列表形式的元素：
| 指数|句话|
| --|--|
| 0 |你可以得到汽车、卡车、飞机和船。|
| 1 |你可以得到汽车，卡车和飞机。|
| 2 |你应该忽略这句话。|
我只希望从以“You can get”或“You can get the”开头的句子中提取元素，我希望使用pandas extractallmethod提取这些元素，在这里我提取句子中列表的每个元素。
所需输出：
| 指数|匹配|对象|
| --|--|--|
| 0 | 0 |车|
| | 卡车| truck |
| | 平面| plane |
| | 船| boat |
| 1 | 0 |车|
| | 卡车| truck |
| | 平面| plane |
我有三个主要问题：
1.如何使用look behinds (?<=[Y|y]ou can get )使其不会捕获the
1.如何包含前瞻\w+(?=s)?，以便捕获元素的复数和单数形式
1.有没有可能写一个捕获组，也提取每个单词作为单独的元素，或者我应该先提取句子中的列表（例如cars, trucks, planes, and boats），然后运行另一个正则表达式？

regex

来源：https://stackoverflow.com/questions/77036056/regex-capture-group-extracting-elements-of-a-list-in-a-sentence

3条答案

按热度按时间

l0oc07j21#

如何使用：

df.loc[df['sentence'].str.startswith('You can get '),
       'sentence'].str.extractall(r'(?P<object>\S+?)s?\b(?:,|.$)')

输出量：

object
  match       
0 0        car
  1      truck
  2      plane
  3       boat
1 0        car
  1      truck
  2      plane

赞(0）回复(0）举报 2023-10-22

nbnkbykc2#

一个更可靠的方法是过滤以"You can get "开头的行，然后替换可选的the文章，可能的and/or连词并提取剩余的单词：

(df[df['sentence'].str.startswith('You can get ')]['sentence']
 .str.replace(r'\b(You can get |the|and|or)\b', '', regex=True)
 .str.extractall(r'(\w+)'))

0
  match        
0 0        cars
  1      trucks
  2      planes
  3       boats
1 0         car
  1       truck
  2       plane

赞(0）回复(0）举报 2023-10-22

x6h2sr283#

1.我认为你应该过滤所有符合第一个条件的句子（[Y|你可以得到）然后你可以尝试使用正则表达式来提取所有需要的部分。

((?:car|truck|plane|boat|box|basis)(?:\w+(?=s|es)?)?)您可以尝试以单数和复数形式捕获，但必须预先定义要捕获的单词列表。测试句子：* 你可以买汽车卡车飞机轮船盒子基础 *

赞(0）回复(0）举报 2023-10-22

我来回答

regex 正则表达式捕获组提取句子中的列表元素

3条答案

相关问题

热门标签

最新问答