针对多个匹配 Dataframe 运行字符串

jq6vz3qz  于 2022-12-25  发布在  其他
关注(0)|答案(1)|浏览(109)

我有一个文本字符串的数据集,看起来像这样:

strings <- structure(list(string = c("Jennifer Rae Hancock Brown", "Lisa Smith Houston Blogger", 
"Tina Fay Las Cruces", "\t\nJamie Tucker Style Expert", "Jessica Wright Htx Satx", 
"Julie Green Lifestyle Blogger", "Mike S Thomas Football Player", 
"Tiny Fitness Houston Studio")), class = "data.frame", row.names = c(NA, 
-8L))

我尝试根据两个不同的数据集firstnamelastname来评估这些字符串中的匹配项,这两个数据集如下所示:

firstname <- structure(list(firstnames = c("Jennifer", "Lisa", "Tina", "Jamie", 
"Jessica", "Julie", "Mike", "George")), class = "data.frame", row.names = c(NA, 
-8L))

lastname <- structure(list(lastnames = c("Hancock", "Smith", "Houston", "Fay", 
"Tucker", "Wright", "Green", "Thomas")), class = "data.frame", row.names = c(NA, 
-8L))

我首先要做的是删除每个字符串中前三个单词之后的所有内容,因此"Jennifer Rae Hancock Brown"将变为"Jessica Rae Hancock""Lisa Smith Houston Blogger"将变为"Lisa Smith Houston"
然后,我想评估每个字符串的第一个单词,看看它是否与firstname Dataframe 中的任何内容匹配。如果匹配,它将在最终的表firstname中创建一个名为的新列,其中包含结果。如果不匹配,结果只是“N/A”。
之后,我将根据lastname Dataframe 评估剩余的单词。可能会有多个匹配(如“丽莎Smith Houston”示例所示),如果是这种情况,两个结果都将存储在最终的 Dataframe 中。
最终的 Dataframe 应如下所示:

final <- structure(list(string = c("Jennifer Rae Hancock Brown", "Lisa Smith Houston Blogger", 
"Lisa Smith Houston Blogger", "Tina Fay Las Cruces", "\t\nJamie Tucker Style Expert", 
"Jessica Wright Htx Satx", "Julie Green Lifestyle Blogger", "Mike S Thomas Football Player", 
"Tiny George Fitness Houston Studio"), firstname = c("Jennifer", 
"Lisa", "Lisa", "Tina", "Jamie", "Jessica", "Julie", "Mike", 
"N/A"), lastname = c("Hancock", "Smith", "Houston", "Fay", "Tucker", 
"Wright", "Green", "Thomas", "N/A")), class = "data.frame", row.names = c(NA, 
-9L))

做这件事最有效的方法是什么?

lpwwtiir

lpwwtiir1#

我们可以在"string2"的子字符串上使用str_extract_all,其中pattern作为名,姓向量转换为单个字符串,使用|(OR作为分隔符)并返回向量的list,然后使用unnestlist转换为向量

library(dplyr)
library(stringr)
library(tidyr)
strings %>%
   mutate(string2 = str_extract(trimws(string), "^\\S+\\s+\\S+\\s+\\S+"),
   firstname = str_extract_all(string2, 
    str_c(firstname$firstnames, collapse = "|")), 
   lastname =str_extract_all(string2, 
     str_c(lastname$lastnames, collapse = "|")) ) %>% 
   unnest(where(is.list), keep_empty = TRUE) %>% 
   select(-string2)%>% 
   mutate(lastname = case_when(complete.cases(firstname) ~ lastname))
  • 输出
# A tibble: 9 × 3
  string                          firstname lastname
  <chr>                           <chr>     <chr>   
1 "Jennifer Rae Hancock Brown"    Jennifer  Hancock 
2 "Lisa Smith Houston Blogger"    Lisa      Smith   
3 "Lisa Smith Houston Blogger"    Lisa      Houston 
4 "Tina Fay Las Cruces"           Tina      Fay     
5 "\t\nJamie Tucker Style Expert" Jamie     Tucker  
6 "Jessica Wright Htx Satx"       Jessica   Wright  
7 "Julie Green Lifestyle Blogger" Julie     Green   
8 "Mike S Thomas Football Player" Mike      Thomas  
9 "Tiny Fitness Houston Studio"   <NA>      <NA>

预期OP

> final
                              string firstname lastname
1         Jennifer Rae Hancock Brown  Jennifer  Hancock
2         Lisa Smith Houston Blogger      Lisa    Smith
3         Lisa Smith Houston Blogger      Lisa  Houston
4                Tina Fay Las Cruces      Tina      Fay
5      \t\nJamie Tucker Style Expert     Jamie   Tucker
6            Jessica Wright Htx Satx   Jessica   Wright
7      Julie Green Lifestyle Blogger     Julie    Green
8      Mike S Thomas Football Player      Mike   Thomas
9 Tiny George Fitness Houston Studio       N/A      N/A

相关问题