regex R：删除字符串中所有带引号的值

uqxowvwt 于 2023-05-08 发布在其他

关注(0)|答案(3)|浏览(171)

我正在使用Twitter数据开始我的第一个R文本分析项目，在预处理阶段，我试图删除所有出现在引号内的值。我发现一些代码删除了引号本身，但没有删除其中的值（例如，“Hello World”变成了Hello World），但没有任何代码始终删除值和引号（例如，“This is a”quoted text”变成了“This is a”）。
我匿名化了一个我正在使用的示例数据框架（保留了这些特定tweet的确切格式，只是内容发生了变化）：

df <- data.frame(text = c("Example: “This is a quote!” https://t.co/ -  MORE TEXT - example: “more text... “quote inside a quote” finished.”", 
                              "Text \"this is a quote.\" More text. https://t.co/"))

对于这个dataframe，目标是最终得到：

Example: https://t.co/ -  MORE TEXT - example: 

Text More text. https://t.co/

我试过这些：

df$text <- gsub('"[^"]+"', '', df$text)

df$text <- gsub('".*"', '', df$text)

df$text <- gsub("[\"'].*['\"]","", df$text)

但我发现它只对成功地从第二个观察中删除引用起作用，而不是第一个。我怀疑这可能与第二个引用是如何从Twitter导入的有关，用\括起来。我不确定这个假设是否正确，如果是，我不知道如何克服它。任何帮助将不胜感激！

regex

来源：https://stackoverflow.com/questions/76169101/r-removing-all-quoted-values-in-a-string

3条答案

按热度按时间

hl0ma9xz1#

如果有两级嵌套引号，可以这样做

碱基R

df <- data.frame(text = c("Example: “This is a quote!” https://t.co/ -  MORE TEXT - example: “more text... “quote inside a quote” finished.”", 
                          "Text \"this is a quote.\" More text. https://t.co/"))

df$text |>
  gsub('(“|")[^”"“]*(”|")', '', x = _) |>
  gsub('(“|")[^”"]*(”|")', '', x = _)
#> [1] "Example:  https://t.co/ -  MORE TEXT - example: "
#> [2] "Text  More text. https://t.co/"

Tidyverse

df <- data.frame(text = c("Example: “This is a quote!” https://t.co/ -  MORE TEXT - example: “more text... “quote inside a quote” finished.”", 
                          "Text \"this is a quote.\" More text. https://t.co/"))
df$text
#> [1] "Example: “This is a quote!” https://t.co/ -  MORE TEXT - example: “more text... “quote inside a quote” finished.”"
#> [2] "Text \"this is a quote.\" More text. https://t.co/"

library(stringr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

df %>% 
  mutate(text = str_remove_all(text, '(“|")[^”"“]*(”|")'),
         text = str_remove_all(text, '(“|")[^”"]*(”|")'))
#>                                               text
#> 1 Example:  https://t.co/ -  MORE TEXT - example: 
#> 2                   Text  More text. https://t.co/

赞(0）回复(0）举报 2023-05-08

eqfvzcg82#

下面是一个使用一行程序模式的解决方案：

library(tidyverse)
df %>%
  mutate(text = str_remove_all(text, '"[^"]+"|“[^“”]+”|“.+”'))
                                              text
1 Example:  https://t.co/ -  MORE TEXT - example: 
2                   Text  More text. https://t.co/

该模式使用三种可选模式来处理text中显示的可变性：

"[^"]+"：第一种选择：删除"中的简单引号
“[^“”]+”：第二种选择：删除“和”中的简单引号
“.+”：第三种选择：删除“和”中嵌套引号的父引号

如果在实际数据中也有嵌套的" "引号，这可以通过另一个交替来解决。

赞(0）回复(0）举报 2023-05-08

jucafojl3#

您可以使用递归?1或?R来匹配“和”的平衡/嵌套结构。
(“([^“”]|(?R))*”)将匹配（嵌套）成对的“和”，其中a(?R)z是一个递归，它匹配一个或多个字母a，后跟完全相同数量的字母z。
对于"，很难区分是否有一个嵌套的字符串或是否有更多的引用字符串。
".*"将假定它们是嵌套的但如果它们是成对的则不计数，
("([^"]|(?R))*")将匹配成对嵌套，并且
"[^"]*"将假定"不是嵌套的。

gsub('("([^"]|(?R))*")|(“([^“”]|(?R))*”)', '', df$text, perl=TRUE)
#[1] "Example:  https://t.co/ -  MORE TEXT - example: "
#[2] "Text  More text. https://t.co/"

gsub('"[^"]*"|(“([^“”]|(?R))*”)', '', df$text, perl=TRUE)
#[1] "Example:  https://t.co/ -  MORE TEXT - example: "
#[2] "Text  More text. https://t.co/"                  

gsub('".*"|(“([^“”]|(?R))*”)', '', df$text, perl=TRUE)
#[1] "Example:  https://t.co/ -  MORE TEXT - example: "
#[2] "Text  More text. https://t.co/"

赞(0）回复(0）举报 2023-05-08

我来回答

regex R：删除字符串中所有带引号的值

3条答案

相关问题

热门标签

最新问答