R语言 我怎样才能解开括号中的短语

j7dteeu8  于 2023-03-27  发布在  其他
关注(0)|答案(2)|浏览(108)

我有一些文本,我试图组织一些文本挖掘,并使用TidyText库。我已经尝试将令牌设置为正则表达式并设置自定义模式,但它只返回括号(或什么都不返回),而不是括号的内容。

library(tidytext)
library(stringr)

df <- data.frame("text" = c("[instruction] [Mortgage][Show if Q1A5]Mortgage Loans","[checkboxes] [min 1] [max OFF] [Show if Q29A2] Please indicate the reason(s) you would not purchase this check package."), "line" = c(1,2))

un <- unnest_regex(df,elements,text,pattern = "\\[(.*?)\\]")

head(un)
  line                                                                  elements
1    1                                                                          
2    1                                                            mortgage loans
3    2                                                                          
4    2                                                                          
5    2                                                                          
6    2  please indicate the reason(s) you would not purchase this check package.

un2 <- unnest_regex(df,elements,text,pattern = "(?<=\\[).+?(?=\\])")

head(un2)
  line        elements
1    1               [
2    1             ] [
3    1              ][
4    1 ]mortgage loans
5    2               [
6    2             ] [

我的最终目标是得到这个:

line             elements
1    1        [instruction]
2    1           [Mortgage]
3    1       [Show if Q1A5]
4    2         [checkboxes]
5    2              [min 1]
6    2            [max OFF]

这可能吗?

qvk1mo1f

qvk1mo1f1#

这应该是可行的,虽然有点笨拙。其想法是使用stringr提取出括号中的所有内容,然后“分解”输出。由于它不是空格分隔的,因此在结束括号中分解,然后稍后将其添加回来。

library(dplyr)
library(stringr)
library(tidyr)

df <- data.frame("text" = c("[instruction] [Mortgage][Show if Q1A5]Mortgage Loans","[checkboxes] [min 1] [max OFF] [Show if Q29A2] Please indicate the reason(s) you would not purchase this check package."), "line" = c(1,2))

df <- df %>%
    dplyr::mutate(
        text_in_brackets = stringr::str_extract_all(text, "\\[[^()]+\\]")
    ) %>%
    tidyr::separate_rows(text_in_brackets, sep = "]") %>%
    dplyr::filter(text_in_brackets != "") %>%
    dplyr::mutate( # some cleaning
        text_in_brackets = paste0(text_in_brackets, "]"), # add back "]"
        text_in_brackets = stringr::str_trim(text_in_brackets) # remove leading/trailing spaces
    )

产出

# A tibble: 7 × 2
   line text_in_brackets
  <dbl> <chr>           
1     1 [instruction]   
2     1 [Mortgage]      
3     1 [Show if Q1A5]  
4     2 [checkboxes]    
5     2 [min 1]         
6     2 [max OFF]       
7     2 [Show if Q29A2]
yqlxgs2m

yqlxgs2m2#

我们可以在Map的帮助下,将gregexpr文本从括号1中取出,然后将其放回括号中。

Map(\(x, y, ...) data.frame(line=x, elements=sprintf(y, fmt='[%s]')), 
    df$line, regmatches(x, gregexpr(r'{[^[\]]+(?=])}', df$text, perl=TRUE))) |>
  do.call(what=rbind)
#   line        elements
# 1    1   [instruction]
# 2    1      [Mortgage]
# 3    1  [Show if Q1A5]
# 4    2    [checkboxes]
# 5    2         [min 1]
# 6    2       [max OFF]
# 7    2 [Show if Q29A2]
  • 数据:*
df <- structure(list(text = c("[instruction] [Mortgage][Show if Q1A5]Mortgage Loans", 
"[checkboxes] [min 1] [max OFF] [Show if Q29A2] Please indicate the reason(s) you would not purchase this check package."
), line = c(1, 2)), class = "data.frame", row.names = c(NA, -2L
))

相关问题