大 Dataframe 的R模糊连接

qmb5sa22 于 2023-02-06 发布在其他

关注(0)|答案(2)|浏览(144)

我想做一个基于模糊匹配的left_join(df1, df2)。我的df1是100k行，我的df2是25k行。基本上我想用jaro winkler方法计算两个 Dataframe 的join_colum之间的字符串相似度。所以这将是100k * 25k的比较。
我希望只有一个匹配而不是多个匹配，所以我的目标是选择字符串相似度最高的匹配，除此之外，我希望只有当jaro winkler字符串相似度至少为0.8时才加入（假设这个条件在我编的例子中得到满足）。
我试着展示输入的样子和我想要的输出，但是由于 Dataframe 的大小，我在实际实现上很挣扎。我已经试过fuzzyjoin库中的函数，但是我遇到了内存问题。我也试过strindist包中的amatch函数，但是我不能让它工作。
这两个解决方案我无法调整，以解决我的问题：fuzzy LEFT join with R，fuzzy outer join/merge in R

library(tidyverse)

df1 <- data.frame(
  id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
  join_column = c("alice123burgerstorechicago", "alicewonderland",
                  "bubbletea45london", "blueonion",
                  "chandle34song", "crazyjoeohio",
                  "donaldduckshop123", "dartcommunitygermany",
                  "evergreen78hall", "exittheroom15florida"))

df2 <- data.frame(
  id = c(15, 16, 18, 20),
  join_column = c("aliceburgerstorechicag", 
                  "bubbletealndon",
                  "crazyjoeohio178",
                  "exittheroom25florid"))

pre_target <- df1
target <- pre_target %>% 
  mutate(joined_with_id = c(15, NA, 16, NA, NA, 18, NA, NA, NA, 20),
         joined_with_string = c("aliceburgerstorechicag", NA, 
                                "bubbletealndon", NA, NA, 
                                "crazyjoeohio178", NA, NA, NA, 
                                "exittheroom25florid"))

来源：https://stackoverflow.com/questions/69930176/r-fuzzy-join-with-big-dataframes

2条答案

按热度按时间

ldioqlga1#

无论你如何处理这个问题，比较的数量是巨大的，而且总是很慢。没有模糊连接这样的东西可以阻止这一点。模糊匹配需要进行，而连接是容易的部分，因为你不尝试在模糊部分连接，而是在基于匹配找到的匹配上连接。
下面是一个关于您的小样本集的演示，它至少显示了如何进行连接。

library(data.table)
library(stringdist)

setDT(df1)
setDT(df2)

df1_options_to_join <- df1$join_column

my_best_match_function <- function(x) {
  z <- stringdist(x, df1_options_to_join, method = "jw")
  z <- setNames(df1_options_to_join, z)
  if(as.numeric(min(names(z))) <= 0.20) { # equals your 0.80 similarity
    return(z[min(names(z))])
  } else {
    return(NA)
  }
}

df2[, match_to_join := my_best_match_function(join_column), by = join_column]

target <- merge(df1, df2, by.x = "join_column", by.y = "match_to_join", all = T)

赞(0）回复(0）举报 2023-02-06

x33g5p2x2#

完全的自我提升，但是我已经写了一个R包zoomerjoin，它使用MinHashing，允许你模糊地连接大型数据集，而不必比较两个 Dataframe 之间的所有行对，这意味着你可以在现代数据科学笔记本电脑上在几秒钟或几分钟内合并中等大小（数百万行）的 Dataframe ，而不会耗尽内存。
最小散列算法是随机算法，意味着存在一些应该连接的对将不被识别的小概率，但是可以通过调整超参数使该概率任意小。
下面是我将如何使用包来连接这些 Dataframe ：

library(tidyverse)
library(fuzzyjoin)

# Must have the Rust compiler installed to install this package
# see https://github.com/beniaminogreen/zoomerjoin for more details 
devtools::install_github("beniaminogreen/zoomerjoin")
library(zoomerjoin)

df1 <- data.frame(
  id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
  join_column = c("alice123burgerstorechicago", "alicewonderland",
                  "bubbletea45london", "blueonion",
                  "chandle34song", "crazyjoeohio",
                  "donaldduckshop123", "dartcommunitygermany",
                  "evergreen78hall", "exittheroom15florida"))

df2 <- data.frame(
  id = c(15, 16, 18, 20),
  join_column = c("aliceburgerstorechicag",
                  "bubbletealndon",
                  "crazyjoeohio178",
                  "exittheroom25florid"))

## Stringdist Command
stringdist_out <- stringdist_inner_join(df1, df2,
                             by = "join_column",
                             max_dist =.4,
                             method = "jaccard")
# id.x              join_column.x id.y          join_column.y
# 1    1 alice123burgerstorechicago   15 aliceburgerstorechicag
# 2    3          bubbletea45london   16         bubbletealndon
# 3    4                  blueonion   16         bubbletealndon
# 4    6               crazyjoeohio   18        crazyjoeohio178
# 5   10       exittheroom15florida   20    exittheroom25florid


## Zoomerjoin Analogue
zoomerjoin_out <- lsh_inner_join(df1,df2,
                      by = "join_column",
                      threshold =.6,
                      n_gram_width = 1)
# id.x              join_column.x id.y          join_column.y
#   3          bubbletea45london   16         bubbletealndon
#   1 alice123burgerstorechicago   15 aliceburgerstorechicag
#   6               crazyjoeohio   18        crazyjoeohio178
#   4                  blueonion   16         bubbletealndon
#  10       exittheroom15florida   20    exittheroom25florid

该软件包已经过fuzzyjoin软件包的测试，以验证其是否正常工作，但仍在开发中，因此，如果文档中有任何错误/问题，请随时提交问题。

赞(0）回复(0）举报 2023-02-06

我来回答

大 Dataframe 的R模糊连接

2条答案

相关问题

热门标签

最新问答