我有一个文本字符串的数据集,看起来像这样:
strings <- structure(list(string = c("Jennifer Rae Hancock Brown", "Lisa Smith Houston Blogger",
"Tina Fay Las Cruces", "\t\nJamie Tucker Style Expert", "Jessica Wright Htx Satx",
"Julie Green Lifestyle Blogger", "Mike S Thomas Football Player",
"Tiny Fitness Houston Studio")), class = "data.frame", row.names = c(NA,
-8L))
我尝试根据两个不同的数据集firstname
和lastname
来评估这些字符串中的匹配项,这两个数据集如下所示:
firstname <- structure(list(firstnames = c("Jennifer", "Lisa", "Tina", "Jamie",
"Jessica", "Julie", "Mike", "George")), class = "data.frame", row.names = c(NA,
-8L))
lastname <- structure(list(lastnames = c("Hancock", "Smith", "Houston", "Fay",
"Tucker", "Wright", "Green", "Thomas")), class = "data.frame", row.names = c(NA,
-8L))
我首先要做的是删除每个字符串中前三个单词之后的所有内容,因此"Jennifer Rae Hancock Brown"
将变为"Jessica Rae Hancock"
,"Lisa Smith Houston Blogger"
将变为"Lisa Smith Houston"
然后,我想评估每个字符串的第一个单词,看看它是否与firstname
Dataframe 中的任何内容匹配。如果匹配,它将在最终的表firstname
中创建一个名为的新列,其中包含结果。如果不匹配,结果只是“N/A”。
之后,我将根据lastname
Dataframe 评估剩余的单词。可能会有多个匹配(如“丽莎Smith Houston”示例所示),如果是这种情况,两个结果都将存储在最终的 Dataframe 中。
最终的 Dataframe 应如下所示:
final <- structure(list(string = c("Jennifer Rae Hancock Brown", "Lisa Smith Houston Blogger",
"Lisa Smith Houston Blogger", "Tina Fay Las Cruces", "\t\nJamie Tucker Style Expert",
"Jessica Wright Htx Satx", "Julie Green Lifestyle Blogger", "Mike S Thomas Football Player",
"Tiny George Fitness Houston Studio"), firstname = c("Jennifer",
"Lisa", "Lisa", "Tina", "Jamie", "Jessica", "Julie", "Mike",
"N/A"), lastname = c("Hancock", "Smith", "Houston", "Fay", "Tucker",
"Wright", "Green", "Thomas", "N/A")), class = "data.frame", row.names = c(NA,
-9L))
做这件事最有效的方法是什么?
1条答案
按热度按时间lpwwtiir1#
我们可以在"string2"的子字符串上使用
str_extract_all
,其中pattern
作为名,姓向量转换为单个字符串,使用|
(OR作为分隔符)并返回向量的list
,然后使用unnest
将list
转换为向量预期OP