如何将str_extract_all转换为多列

ldioqlga  于 2023-07-31  发布在  其他
关注(0)|答案(4)|浏览(79)

以下是文本:

data$charge[1]
  [1] "Count #1 as Filed: In Violation of; 21 O.S. 645; Count #2 as Filed: In Violation of; 21 O.S. 1541.1;Docket 1"

字符串
我目前正试图从法律的数据中提取法规。我的代码看起来像这样:

str_extract_all(data$charge[1:3], "(?<=Violation of;)(\\D|\\d){4,20}(?=;Count |;Docket)") 

[[1]]
[1] "21 O.S. 645"      "21 O.S. 1541.1"

[[2]]
[1]  "21 O.S. 1435     "21 O.S. 1760(A)(1)

[[3]]
[1]   "21 O.S. 1592"


我想将它们作为列添加到数据框中,如下所示:

id           name           statute1           statute2           statute3
1           BLACK, JOHN     21 O.S. 645        21 O.S. 1541.1     NA
2           DOE, JANE       21 O.S. 1435       21 O.S. 1760(A)(1) NA
3           ROSS, BOB       21 O.S. 1592       NA                 NA


谢谢!这有道理吗?

djmepvbi

djmepvbi1#

由于您没有包含数据或预期输出的可重现示例,因此我不能确定,但我认为您要查找的是str_extract_allsimplify = TRUE参数。
?str_extract_all上的示例:

shopping_list <- c("apples x4", "bag of flour", "bag of sugar", "milk x2")

# without simplify = TRUE
str_extract_all(shopping_list, "\\b[a-z]+\\b")
[[1]]
[1] "apples"

[[2]]
[1] "bag"   "of"    "flour"

[[3]]
[1] "bag"   "of"    "sugar"

[[4]]
[1] "milk"

# with simplify = TRUE
str_extract_all(shopping_list, "\\b[a-z]+\\b", simplify = TRUE)
     [,1]     [,2] [,3]   
[1,] "apples" ""   ""     
[2,] "bag"    "of" "flour"
[3,] "bag"    "of" "sugar"
[4,] "milk"   ""   ""

字符串
使用您添加的示例:

dat <- "Count #1 as Filed: In Violation of; 21 O.S. 645; Count #2 as Filed: In Violation of; 21 O.S. 1541.1;Docket 1"

str_extract_all(dat, "(?<=Violation of;)(\\D|\\d){4,20}(?=;Count |;Docket)",
                simplify = TRUE)

     [,1]             
[1,] " 21 O.S. 1541.1"

jei2mxaa

jei2mxaa2#

这不是最有效的解决方案,但与其他人相比,我可以理解:

df = tribble(
  ~foo,
  "1,2",
  "3,4"
)

df %>% mutate(
  col1 = str_extract_all(foo, "\\d+", simplify = TRUE)[,1],
  col2 = str_extract_all(foo, "\\d+", simplify = TRUE)[,2],
)

字符串
退货:

# A tibble: 2 x 3
  foo   col1  col2 
  <chr> <chr> <chr>
1 1,2   1     2    
2 3,4   3     4

5uzkadbs

5uzkadbs3#

您可以使用tidyverse包来实现这一点。您的示例中的正则表达式模式不适用于提供的某些示例文本,因为它总是需要尾随分号。下面使用的模式应该更简单,但可能需要根据实际文本进行一些调整。

library(tidyverse)

df %>% 
  mutate(charges = str_extract_all(charge, "(?<=Violation of;\\s).+?(?=(;|$))")) %>% # extracts the different charges
  select(-charge) %>%  # dropping the raw text can be skipped
  unnest(charges) %>%  # seperates the different charges for each name
  group_by(name) %>%   # in this sample there is only a name, but hopefully the real data has some sort of unique id - there could be lots of Jane Doe's in this data
  mutate(statute = paste0('statute', row_number())) %>% # adds a statute number to each charge
  spread(statute, charges) # shift the data from long to wide

# A tibble: 3 x 3
# Groups:   name [3]
  name       statute1        statute2             
  <chr>      <chr>           <chr>                
1 BLACK,JOHN 21 O.S. 645  21 O.S. 1541.1    
2 DOE, JANE  21 O.S. 1435 21 O.S. 1760(A)(1)
3 ROSS, BOB  21 O.S. 1592 NA

字符串
样本数据:

df <- data_frame(name = c('BLACK,JOHN', 'DOE, JANE', 'ROSS, BOB'), 
                 charge = c('Count #1 as Filed: In Violation of; 21 O.S. 645; Count #2 as Filed: In Violation of; 21 O.S. 1541.1;Docket 1',
                            'Count #3 as Filed: In Violation of; 21 O.S. 1435; Count #4 as Filed: In Violation of; 21 O.S. 1760(A)(1)',
                            'Count #2 as Filed: In Violation of; 21 O.S. 1592'))

guykilcj

guykilcj4#

可以使用separate_wider_regex函数:

data <- data.frame(
    charge = c("Count #1 as Filed: In Violation of; 21 O.S. 645; Count #2 as Filed: In Violation of; 21 O.S. 1541.1;Docket 1"))

library(tidyr)

separate_wider_regex(data, charge, patterns = c("Count #1 as Filed: In Violation of; ", statute1 = "[^;]+", "; Count #2 as Filed: In Violation of; ", statute2 = "[^;]+","; Count #3 as Filed: In Violation of; ", statute3 = "[^;]+"), too_few = "align_start")

# Output
# A tibble: 1 × 3
  statute1    statute2       statute3
  <chr>       <chr>          <chr>   
1 21 O.S. 645 21 O.S. 1541.1 NA

字符串

相关问题