regex 如何使用R提取字符串中沿着日期模式?

cfh9epnr  于 2023-01-21  发布在  其他
关注(0)|答案(2)|浏览(139)

我想从句子中提取日期和正则表达式模式(日期在模式之后)。

text1 <- "The number of subscribers as of December 31, 2022 of Netflix increased by 1.15% compared to the previous year."

text2 <- "Netflix's number of subscribers in January 10, 2023 has grown more than 1.50%."

模式是number of subscribers,然后是Month Day, Year的日期。有时在模式和日期之间有as ofinno characters
我试过下面的脚本。

find_dates <- function(text){
  
  pattern <- "\\bnumber\\s+of\\s+subscribers\\s+(\\S+(?:\\s+\\S+){3})" # pattern and next 3 words

  str_extract(text, pattern)

}

然而,这也提取了中间词,我想忽略它。
预期输出:
查找日期(文本1)
'2022年12月31日的订户数量'
查找日期(文本2)
'2023年1月10日订阅者数量'

soat7uwm

soat7uwm1#

text1 <- "The number of subscribers as of December 31, 2022 of Netflix increased by 1.15% compared to the previous year."
text2 <- "Netflix's number of subscribers in January 10, 2023 has grown more than 1.50%."

find_dates <- function(text){
  # pattern <- "(\\bnumber\\s+of\\s+subscribers)\\s+(\\S+(?:\\s+\\S+){3})" # pattern and next 3 words
  pattern <- "(\\bnumber\\s+of\\s+subscribers)(?:\\s+as\\s+of\\s|\\s+in\\s+)?(\\S+(\\s+\\S+){2})" # pattern and next 3 words
  str_extract(text, pattern, 1:2)

}

find_dates(text1)
# [1] "number of subscribers" "December 31, 2022"    

find_dates(text2)
# [1] "number of subscribers" "January 10, 2023"
dphi5xsq

dphi5xsq2#

使用stringr的方法

library(stringr)

find_Dates <- function(x) paste0(str_extract_all(x, 
  "\\bnumber\\b (\\b\\S+\\b ){2}|\\b\\S+\\b \\d{2}, \\d{4}")[[1]], collapse="")
find_Dates(text1)
[1] "number of subscribers December 31, 2022"

# all texts
lapply(c(text1, text2), find_Dates)
[[1]]
[1] "number of subscribers December 31, 2022"

[[2]]
[1] "number of subscribers January 10, 2023"

相关问题