我正在尝试计算以下文本字符串:
text_example <- structure(list(text_string = c(
"REALTOR IN DALLAS, TX! CALL NOW FOR SHOWINGS", "BORN IN EL PASO, TX AND CURRENTLY LIVING IN HOUSTON"
)), class = "data.frame", row.names = c(NA, -2L))
我正在尝试使用源文件从文本字符串中提取城市和州的名称。下面是源文件:
city_example <- structure(list(city_ex = c("DALLAS, TX", "EL PASO, TX"), city = c("DALLAS",
"EL PASO"), State = c("TX", "TX")), class = "data.frame", row.names = c(NA,
-2L))
我希望最终输出如下所示:
output_example <- structure(list(text_string = c("BORN IN EL PASO, TX AND CURRENTLY LIVING IN HOUSTON",
"REALTOR IN DALLAS, TX! CALL NOW FOR SHOWINGS", ""), city = c("EL PASO",
"DALLAS", ""), state = c("TX", "TX", "")), class = "data.frame", row.names = c(NA,
-3L))
但是当我运行下面的代码时,它返回零个结果,这是不应该的:
output_example <- text_example %>%
separate_rows(text_string) %>%
left_join(city_example, by = c("text_string" = "city_ex")) %>%
filter(!is.na(state)) %>% dplyr::select(text_string, city, state) %>% distinct()
看起来不起作用的代码是怎么回事?如何才能最好地修复它?
1条答案
按热度按时间fhg3lkii1#
您可以使用
fuzzyjoin
:您的
output_example
建议不匹配的行应该是空字符串,因此您可以使用以下命令重现:...虽然我个人更喜欢使用
NA
值(空字符串有时确实有值)或删除行(使用regex_inner_join
)。