regex 如何使用R提取字符串中沿着日期模式？

cfh9epnr 于 2023-01-21 发布在其他

关注(0)|答案(2)|浏览(139)

我想从句子中提取日期和正则表达式模式（日期在模式之后）。

text1 <- "The number of subscribers as of December 31, 2022 of Netflix increased by 1.15% compared to the previous year."

text2 <- "Netflix's number of subscribers in January 10, 2023 has grown more than 1.50%."

模式是number of subscribers，然后是Month Day, Year的日期。有时在模式和日期之间有as of或in或no characters。
我试过下面的脚本。

find_dates <- function(text){
  
  pattern <- "\\bnumber\\s+of\\s+subscribers\\s+(\\S+(?:\\s+\\S+){3})" # pattern and next 3 words

  str_extract(text, pattern)

}

然而，这也提取了中间词，我想忽略它。
预期输出：
查找日期（文本1）
'2022年12月31日的订户数量'
查找日期（文本2）
'2023年1月10日订阅者数量'

regex

来源：https://stackoverflow.com/questions/75166032/how-to-extract-patterns-along-with-dates-in-string-using-r

2条答案

按热度按时间

soat7uwm1#

text1 <- "The number of subscribers as of December 31, 2022 of Netflix increased by 1.15% compared to the previous year."
text2 <- "Netflix's number of subscribers in January 10, 2023 has grown more than 1.50%."

find_dates <- function(text){
  # pattern <- "(\\bnumber\\s+of\\s+subscribers)\\s+(\\S+(?:\\s+\\S+){3})" # pattern and next 3 words
  pattern <- "(\\bnumber\\s+of\\s+subscribers)(?:\\s+as\\s+of\\s|\\s+in\\s+)?(\\S+(\\s+\\S+){2})" # pattern and next 3 words
  str_extract(text, pattern, 1:2)

}

find_dates(text1)
# [1] "number of subscribers" "December 31, 2022"    

find_dates(text2)
# [1] "number of subscribers" "January 10, 2023"

赞(0）回复(0）举报 2023-01-21

dphi5xsq2#

使用stringr的方法

library(stringr)

find_Dates <- function(x) paste0(str_extract_all(x, 
  "\\bnumber\\b (\\b\\S+\\b ){2}|\\b\\S+\\b \\d{2}, \\d{4}")[[1]], collapse="")

find_Dates(text1)
[1] "number of subscribers December 31, 2022"

# all texts
lapply(c(text1, text2), find_Dates)
[[1]]
[1] "number of subscribers December 31, 2022"

[[2]]
[1] "number of subscribers January 10, 2023"

赞(0）回复(0）举报 2023-01-21

我来回答

regex 如何使用R提取字符串中沿着日期模式？

2条答案

相关问题

热门标签

最新问答