java级联连接两个文件非常慢

dkqlctbz 于 2021-06-04 发布在 Hadoop

关注(0)|答案(2)|浏览(406)

我使用级联做一个哈希连接两个300mb的文件。我执行以下级联工作流：

// select the field which I need from the first file
Fields f1 = new Fields("id_1");
docPipe1 = new Each( docPipe1, scrubArguments, new ScrubFunction( f1 ), Fields.RESULTS );   

// select the fields which I need from the second file 
Fields f2 = new Fields("id_2","category");
docPipe2 = new Each( docPipe2, scrubArguments, new ScrubFunction( f2), Fields.RESULTS ); 

// hashJoin
Pipe tokenPipe = new HashJoin( docPipe1, new Fields("id_1"), 
                     docPipe2, new Fields("id_2"), new LeftJoin());

// count the number of each "category" based on the id_1 matching id_2
Pipe pipe = new Pipe(tokenPipe );
pipe = new GroupBy( pipe , new Fields("category"));
pipe = new Every( pipe, Fields.ALL, new Count(), Fields.ALL );

我在hadoop集群上运行这个级联程序，这个集群有3个datanode，每个datanode有8个ram和4个内核（我将mapred.child.java.opts设置为4096mb）；但我花了大约30分钟才得到最终结果。我觉得太慢了，但是我觉得我的程序和集群都没有问题。如何使此级联联接更快？

Java hadoop cascading

来源：https://stackoverflow.com/questions/20433003/cascading-join-two-files-very-slow

2条答案

按热度按时间

xurqigkl1#

您的hadoop集群可能正忙，或者可能正致力于其他工作，因此需要花费时间。我不认为用cogroup替换hashjoin会有帮助，因为cogroup是一个reduce-side连接，而hashjoin是一个map-side连接，因此hashjoin将比congroup更有效。我认为您应该用一个不那么忙的集群再试一次，因为您的代码看起来也不错。

赞(0）回复(0）举报 2021-06-04

pkmbmrz72#

如级联用户指南所示
hashjoin尝试将整个右侧流保留在内存中以便快速比较（不仅仅是当前分组，因为没有对hashjoin执行分组）。因此右侧流中非常大的元组流可能会超过可配置的溢出到磁盘阈值，从而降低性能并可能导致内存错误。因此，建议使用右侧较小的流。
或
使用可能有用的cogroup

赞(0）回复(0）举报 2021-06-04