在两个文件上设置操作diff

gk7wooem 于 2021-05-30 发布在 Hadoop

关注(0)|答案(1)|浏览(382)

我想得到两个平面/csv文件的差异源和目标将有相同的模式。
比如说，
source.txt文件：
empid | regionid |销售
001 | r01 | 10000美元
002 | r02 | 20000美元
003 | r03 | 30000美元
target.txt文件：
empid | regionid |销售
001 | r01 | 10000美元
002 | r02 | 10000美元
004 | r04 | 40000美元
结果应为：
empid1 |地区1 |销售1 | empid2 |地区2 |销售2 |结果|状态
001 | r01 | 10000美元| 001 | r01 | 10000美元|匹配
002 | r02 | 20000美元| 002 | r02 | 10000美元|不匹配
003 | r03 | 30000美元|空|空|不匹配
空|空|空| 004 | r04 | 40000美元|不匹配
任何帮助都会得到感谢！！
编辑时间：
假设2个文件的大小很大，这个问题看起来可能更简单，但是我正在尝试找到最好的方法，性能是这里的主要标准，技术可以是任何东西，甚至hadoop map reduce，我尝试过使用hive，但是速度有点慢。

hadoop Hive Algorithm Set

来源：https://stackoverflow.com/questions/27035358/set-operation-diff-on-two-files

1条答案

按热度按时间

jm2pwxwz1#

下面是一个map-reduce方法来解决这个问题（在高级伪代码中）：

map(source):
   for each line x|y|z:
     emitIntermediate(x,(1,y|z))
map(target):
   for each line x|y|z:
     emitIntermediate(x,(2,y|z))

//make sure each list is sorted/ sort it yourself 1 is before 2 if both exists.
reduce(x, list):
   if list.size() == 1:
      (idx,y|z) <- list.first() //this is the configuration of the element in the list
      if idx == 1:
            emit(x|y|z|NULL|NULL|NULL|unmatched)
      else:
            emit(NULL|NULL|NULL|x|y|z|unmatched)
   else:
       (1,y1|z1) <- list.first()
       (2,y2|z2) <- list.last()
       m = (y1|z1 matches y2|z2 ? "matched" : "unmatched")
       emit(x|y1|z2|x|y2|z2|m)

其思想是在Map阶段将reduce部分中的数据拆分为不同的id，并让reducer检查region和sales是否匹配。
在大型集群上实现它（并以分布式文件格式）可以显著提高性能，因为工作是通过map-reduce框架跨集群分布的。
例如，可以使用hadoop作为实现框架。

赞(0）回复(0）举报 2021-05-30

我来回答

在两个文件上设置操作diff

1条答案

相关问题

热门标签

最新问答