我有一个mapreduce程序,当在1%的数据集上运行时,这是所需的时间:
Job Counters
Launched map tasks=3
Launched reduce tasks=45
Data-local map tasks=1
Rack-local map tasks=2
Total time spent by all maps in occupied slots (ms)=29338
Total time spent by all reduces in occupied slots (ms)=200225
Total time spent by all map tasks (ms)=29338
Total time spent by all reduce tasks (ms)=200225
Total vcore-seconds taken by all map tasks=29338
Total vcore-seconds taken by all reduce tasks=200225
Total megabyte-seconds taken by all map tasks=30042112
Total megabyte-seconds taken by all reduce tasks=205030400
我怎样才能推断出分析100%数据所需的时间?我的理由是,因为1%是一个街区,所以它需要100倍多的时间,但在100%上运行时,实际需要134倍多的时间。
100%数据的计时
Job Counters
Launched map tasks=2113
Launched reduce tasks=45
Data-local map tasks=1996
Rack-local map tasks=117
Total time spent by all maps in occupied slots (ms)=26800451
Total time spent by all reduces in occupied slots (ms)=3607607
Total time spent by all map tasks (ms)=26800451
Total time spent by all reduce tasks (ms)=3607607
Total vcore-seconds taken by all map tasks=26800451
Total vcore-seconds taken by all reduce tasks=3607607
Total megabyte-seconds taken by all map tasks=27443661824
Total megabyte-seconds taken by all reduce tasks=3694189568
2条答案
按热度按时间relj7zay1#
根据map在一小部分数据上的性能来预测map的性能并非易事。如果你看1%运行的日志,它使用45个减速器。同样数量的减速机仍然用于100%的数据。这意味着减速机用于处理洗牌和排序阶段的完整输出的时间量不是线性的。
有一些数学模型可以用来预测map-reduce的性能。
下面是其中一篇研究论文,它对map-reduce性能有了更深入的了解。
http://personal.denison.edu/~bressoud/graybressoudmcurcsm2012.pdf
希望这些信息有用。
7y4bm7vi2#
如前所述,预测mapreduce作业的运行时并非易事。问题是作业的执行时间是由最后一个并行任务的完成时间定义的。任务的执行时间取决于它运行的硬件、并发工作负载、数据倾斜等等。。。
杜克大学的海星项目也许值得一看。它包括一个hadoop作业的性能模型,可以调整作业配置,以及一些易于调试的可视化特性。