R语言如何模糊匹配来自两个数据集的字符串？

bihw5rsg 于 2023-09-27 发布在其他

关注(0)|答案(7)|浏览(195)

我一直在研究一种基于不完美字符串（例如公司名称）连接两个数据集的方法。在过去，我必须匹配两个非常脏的列表，一个列表有姓名和财务信息，另一个列表有姓名和地址。都没有唯一的ID来匹配！假设已经进行了清洁，并且可能存在打印错误和插入错误。
到目前为止，AGREP是我发现的最接近的工具。我可以使用AGREP包中的levenshtein距离，它测量两个字符串之间的删除、插入和替换的数量。P将返回具有最小距离（最相似）的字符串。
但是，我在将此命令从单个值转换为将其应用于整个 Dataframe 时遇到了麻烦。我已经粗略地使用了一个for循环来重复AGREP函数，但肯定有更简单的方法。
请参阅下面的代码：

a<-data.frame(name=c('Ace Co','Bayes', 'asd', 'Bcy', 'Baes', 'Bays'),price=c(10,13,2,1,15,1))
b<-data.frame(name=c('Ace Co.','Bayes Inc.','asdf'),qty=c(9,99,10))

for (i in 1:6){
    a$x[i] = agrep(a$name[i], b$name, value = TRUE, max = list(del = 0.2, ins = 0.3, sub = 0.4))
    a$Y[i] = agrep(a$name[i], b$name, value = FALSE, max = list(del = 0.2, ins = 0.3, sub = 0.4))
}

来源：https://stackoverflow.com/questions/26405895/how-can-i-fuzzy-match-strings-from-two-datasets

7条答案

按热度按时间

crcmnpdw1#

下面是一个使用fuzzyjoin包的解决方案。它使用类似于dplyr的语法和stringdist作为模糊匹配的可能类型之一。
正如suggested by@C8H10N4O2一样，stringdist method=“jw”为您的示例创建了最佳匹配。
作为fuzzyjoin的开发者@dgrtwo的suggested，我使用了一个大的max_dist，然后使用dplyr::group_by和dplyr::slice_min来获得最小距离的最佳匹配。（slice_min替换旧的top_n，如果原始顺序很重要，而不是按字母顺序，请使用mutate(rank = row_number(dist)) %>% filter(rank == 1)）

a <- data.frame(name = c('Ace Co', 'Bayes', 'asd', 'Bcy', 'Baes', 'Bays'),
                price = c(10, 13, 2, 1, 15, 1))
b <- data.frame(name = c('Ace Co.', 'Bayes Inc.', 'asdf'),
                qty = c(9, 99, 10))

library(fuzzyjoin); library(dplyr);

stringdist_join(a, b, 
                by = "name",
                mode = "left",
                ignore_case = FALSE, 
                method = "jw", 
                max_dist = 99, 
                distance_col = "dist") %>%
  group_by(name.x) %>%
  slice_min(order_by = dist, n = 1)

#> # A tibble: 6 x 5
#> # Groups:   name.x [6]
#>   name.x price     name.y   qty       dist
#>   <fctr> <dbl>     <fctr> <dbl>      <dbl>
#> 1 Ace Co    10    Ace Co.     9 0.04761905
#> 2  Bayes    13 Bayes Inc.    99 0.16666667
#> 3    asd     2       asdf    10 0.08333333
#> 4    Bcy     1 Bayes Inc.    99 0.37777778
#> 5   Baes    15 Bayes Inc.    99 0.20000000
#> 6   Bays     1 Bayes Inc.    99 0.20000000

赞(0）回复(0）举报 2023-09-27

l3zydbqr2#

该解决方案取决于a到b的匹配所需的基数。如果是一对一，你会得到上面三个最接近的匹配。如果是多对一，你会得到六个。

一一对应（需要赋值算法）：

当我不得不这样做之前，我把它作为一个分配问题的距离矩阵和分配启发式（贪婪分配使用以下）。如果你想要一个“最佳”解决方案，你最好使用optim。
不熟悉AGREP，但这里的例子使用stringdist为您的距离矩阵。

library(stringdist)
d <- expand.grid(a$name,b$name) # Distance matrix in long form
names(d) <- c("a_name","b_name")
d$dist <- stringdist(d$a_name,d$b_name, method="jw") # String edit distance (use your favorite function here)

# Greedy assignment heuristic (Your favorite heuristic here)
greedyAssign <- function(a,b,d){
  x <- numeric(length(a)) # assgn variable: 0 for unassigned but assignable, 
  # 1 for already assigned, -1 for unassigned and unassignable
  while(any(x==0)){
    min_d <- min(d[x==0]) # identify closest pair, arbitrarily selecting 1st if multiple pairs
    a_sel <- a[d==min_d & x==0][1] 
    b_sel <- b[d==min_d & a == a_sel & x==0][1] 
    x[a==a_sel & b == b_sel] <- 1
    x[x==0 & (a==a_sel|b==b_sel)] <- -1
  }
  cbind(a=a[x==1],b=b[x==1],d=d[x==1])
}
data.frame(greedyAssign(as.character(d$a_name),as.character(d$b_name),d$dist))

生成分配：

a          b       d
1 Ace Co    Ace Co. 0.04762
2  Bayes Bayes Inc. 0.16667
3    asd       asdf 0.08333

我相信有一种更优雅的方法来进行贪婪分配启发式，但上面的方法对我来说很有效。

多对一（非赋值问题）：

do.call(rbind, unname(by(d, d$a_name, function(x) x[x$dist == min(x$dist),])))

生成结果：

a_name     b_name    dist
1  Ace Co    Ace Co. 0.04762
11   Baes Bayes Inc. 0.20000
8   Bayes Bayes Inc. 0.16667
12   Bays Bayes Inc. 0.20000
10    Bcy Bayes Inc. 0.37778
15    asd       asdf 0.08333

**编辑：**使用method="jw"生成所需结果。参见help("stringdist-package")

赞(0）回复(0）举报 2023-09-27

bz4sfanl3#

我不确定这对你是否有用，John Andrews，但它为你提供了另一个工具（来自RecordLinkage包），可能会有所帮助。

install.packages("ipred")
install.packages("evd")
install.packages("RSQLite")
install.packages("ff")
install.packages("ffbase")
install.packages("ada")
install.packages("~/RecordLinkage_0.4-1.tar.gz", repos = NULL, type = "source")

require(RecordLinkage) # it is not on CRAN so you must load source from Github, and there are 7 dependent packages, as per above

compareJW <- function(string, vec, cutoff) {
  require(RecordLinkage)
  jarowinkler(string, vec) > cutoff
}

a<-data.frame(name=c('Ace Co','Bayes', 'asd', 'Bcy', 'Baes', 'Bays'),price=c(10,13,2,1,15,1))
b<-data.frame(name=c('Ace Co.','Bayes Inc.','asdf'),qty=c(9,99,10))
a$name <- as.character(a$name)
b$name <- as.character(b$name)

test <- compareJW(string = a$name, vec = b$name, cutoff = 0.8)  # pick your level of cutoff, of course
data.frame(name = a$name, price = a$price, test = test)

> data.frame(name = a$name, price = a$price, test = test)
    name price  test
1 Ace Co    10  TRUE
2  Bayes    13  TRUE
3    asd     2  TRUE
4    Bcy     1 FALSE
5   Baes    15  TRUE
6   Bays     1 FALSE

赞(0）回复(0）举报 2023-09-27

8fsztsew4#

模糊匹配

Approximate String Matching近似匹配一个字符串到另一个字符串。例如banana和bananas。
Fuzzy Matching是在字符串中找到一个近似模式。例如bananas in pyjamas内的banana。
| | R实现| R Implementation |
| --|--|--|
| 基本|比塔普≈莱文施泰因|b$name <- lapply(b$name, agrep, a$name, value=TRUE); merge(a,b)|
| 先进|?stringdist::stringdist-metrics| fuzzyjoin::stringdist_join(a, b, mode='full', by=c('name'), method='lv')|
| 模糊匹配|TRE| agrep2 <- function(pattern, x) x[which.min(adist(pattern, x, partial=TRUE))]; b$name <- lapply(b$name, agrep2, a$name); merge(a, b)|

自己跑

# Data
a <- data.frame(name=c('Ace Co.', 'Bayes Inc.', 'asdf'), qty=c(9,99,10))
b <- data.frame(name=c('Ace Company', 'Bayes', 'asd', 'Bcy', 'Baes', 'Bays'), price=c(10,13,2,1,15,1))

# Basic
c <- b
c$name.b <- c$name
c$name <- lapply(c$name, agrep, a$name, value=TRUE)
merge(a, c, all.x=TRUE)

# Advanced
fuzzyjoin::stringdist_join(a, b, mode='full')

# Fuzzy Match
c <- b
c$name.b <- c$name
c$name <- lapply(c$name, function(pattern, x) x[which.min(adist(pattern, x, partial=TRUE))], a$name)
merge(a, c)

赞(0）回复(0）举报 2023-09-27

ecbunoof5#

我在以下情况下使用lapply：

yournewvector: lapply(yourvector$yourvariable, agrep, yourothervector$yourothervariable, max.distance=0.01),

那么把它写成csv就不那么简单了：

write.csv(matrix(yournewvector, ncol=1), file="yournewvector.csv", row.names=FALSE)

赞(0）回复(0）举报 2023-09-27

lrpiutwd6#

同意上面的回答“* 不熟悉AGREP，但这里的例子使用stringdist为您的距离矩阵。*”，但从Merging Data Sets Based on Partially Matched Data Elements添加下面的签名函数将更准确，因为LV的计算基于position/addition/deletion

##Here's where the algorithm starts...
##I'm going to generate a signature from country names to reduce some of the minor differences between strings
##In this case, convert all characters to lower case, sort the words alphabetically, and then concatenate them with no spaces.
##So for example, United Kingdom would become kingdomunited
##We might also remove stopwords such as 'the' and 'of'.
signature=function(x){
  sig=paste(sort(unlist(strsplit(tolower(x)," "))),collapse='')
  return(sig)
}

赞(0）回复(0）举报 2023-09-27

0h4hbjxa7#

下面是我用来获取公司在列表中出现的次数的方法，尽管公司名称不完全匹配，

step.1安装phonics包
step.2在“mylistofcompanynames”中新建一列“soundexcodes”
step.3使用soundex函数返回soundexcodes中公司名称的soundex编码
step.4将公司名称和相应的soundex代码复制到名为“companynames”和“soundexcode”的新文件（2列）中，名为“companynesoundexcodestrainingfile”
step.5删除companysoundexcodestrainingfile中重复的soundexcodes
step.6浏览剩余公司名称的列表，并按照您希望其出现在原始公司中的方式更改名称

示例：* 亚马逊公司A625可以是亚马逊A625埃森哲有限公司A455可以是埃森哲A455
step.6通过“soundexcodes”在companysoundexcodestrainingfile$soundexcodes和mystofcompanynames $soundexcodes之间执行left_join或（简单vlookup）
step.7结果应该有原始列表，新列名为“co.y”，其中有公司的名称，就像你在培训文件中留下的那样。
step.8对“co.y”进行排序，并检查大部分公司名称是否匹配正确，如果匹配正确，则将旧公司名称替换为soundex代码的vlookup给出的新公司名称。

赞(0）回复(0）举报 2023-09-27

我来回答

R语言如何模糊匹配来自两个数据集的字符串？

7条答案

模糊匹配

自己跑

相关问题

热门标签

最新问答

R语言 如何模糊匹配来自两个数据集的字符串？

7条答案

模糊匹配

自己跑

相关问题

热门标签

最新问答

R语言如何模糊匹配来自两个数据集的字符串？