R语言 将文本字符串与源文件匹配

uemypmqf  于 2022-12-20  发布在  其他
关注(0)|答案(1)|浏览(102)

我正在尝试计算以下文本字符串:

text_example <- structure(list(text_string = c( 
"REALTOR IN DALLAS, TX! CALL NOW FOR SHOWINGS", "BORN IN EL PASO, TX AND CURRENTLY LIVING IN HOUSTON"
)), class = "data.frame", row.names = c(NA, -2L))

我正在尝试使用源文件从文本字符串中提取城市和州的名称。下面是源文件:

city_example <- structure(list(city_ex = c("DALLAS, TX", "EL PASO, TX"), city = c("DALLAS", 
"EL PASO"), State = c("TX", "TX")), class = "data.frame", row.names = c(NA, 
-2L))

我希望最终输出如下所示:

output_example <- structure(list(text_string = c("BORN IN EL PASO, TX AND CURRENTLY LIVING IN HOUSTON", 
"REALTOR IN DALLAS, TX! CALL NOW FOR SHOWINGS", ""), city = c("EL PASO", 
"DALLAS", ""), state = c("TX", "TX", "")), class = "data.frame", row.names = c(NA, 
-3L))

但是当我运行下面的代码时,它返回零个结果,这是不应该的:

output_example <- text_example %>%
  separate_rows(text_string) %>%
  left_join(city_example, by = c("text_string" = "city_ex")) %>%
  filter(!is.na(state)) %>% dplyr::select(text_string, city, state) %>% distinct()

看起来不起作用的代码是怎么回事?如何才能最好地修复它?

fhg3lkii

fhg3lkii1#

您可以使用fuzzyjoin

fuzzyjoin::regex_left_join(text_example, city_example, by = c("text_string" = "city_ex"))
#                                           text_string     city_ex    city State
# 1                   COLLEGE STUDENT LIVING IN HOUSTON        <NA>    <NA>  <NA>
# 2        REALTOR IN DALLAS, TX! CALL NOW FOR SHOWINGS  DALLAS, TX  DALLAS    TX
# 3 BORN IN EL PASO, TX AND CURRENTLY LIVING IN HOUSTON EL PASO, TX EL PASO    TX

您的output_example建议不匹配的行应该是空字符串,因此您可以使用以下命令重现:

library(dplyr)
fuzzyjoin::regex_left_join(text_example, city_example, by = c("text_string" = "city_ex")) %>%
  mutate(across(everything(), ~ if_else(is.na(city_ex), "", .))) %>%
  select(-city_ex)
#                                           text_string    city State
# 1                                                                  
# 2        REALTOR IN DALLAS, TX! CALL NOW FOR SHOWINGS  DALLAS    TX
# 3 BORN IN EL PASO, TX AND CURRENTLY LIVING IN HOUSTON EL PASO    TX

...虽然我个人更喜欢使用NA值(空字符串有时确实有值)或删除行(使用regex_inner_join)。

相关问题