regex 如何在R中提取字符串中的不同模式?

hgb9j2n6  于 2023-01-18  发布在  其他
关注(0)|答案(1)|浏览(113)

我想从下面的句子中抽取一个短语模式。

text1 <- "On a year-on-year basis, the number of subscribers of Netflix increased 1.15% in November last year."

text2 <- "There is no confirmed audited number of subscribers in the Netflix's earnings report."

text3 <- "Netflix's unaudited number of subscribers has grown more than 1.50% at the last quarter."

模式为number of subscribersaudited number of subscribersunaudited number of subscribers
我使用了前面一个问题中的模式\\bnumber\\s+of\\s+subscribers?\\b(感谢@wiktor-stribizew),然后提取短语。

find_words <- function(text){
  
  pattern <- "\\bnumber\\s+of\\s+subscribers?\\b" # something like this

  str_extract(text, pattern)

}

然而,这提取了精确的number of subscriber,而不是其他模式。
预期输出:
查找单词(文本1)
'订阅者数量'
查找单词(文本2)
'审计的订阅者数量'
查找单词(文本3)
'未审核的订阅者数'

50pmv0ei

50pmv0ei1#

看看这个行不行

find_words <- function(text){

pattern <- "(audited |unaudited )?number\\s+of\\s+subscribers"

str_extract(text, pattern)

}

您可以使用您提供的示例文本进行测试:

find_words(text1)
# 'number of subscribers'
find_words(text2)
# 'audited number of subscribers'
find_words(text3)
# 'unaudited number of subscribers'

相关问题