sparkv3.0.0-warn-dagscheduler:广播大小为xx的大任务二进制文件

33qvvth1  于 2021-05-27  发布在  Spark
关注(0)|答案(1)|浏览(1236)

我是新来的。我正在spark standalone(v3.0.0)中编写一个机器学习算法,配置如下:

SparkConf conf = new SparkConf();
conf.setMaster("local[*]");
conf.set("spark.driver.memory", "8g");
conf.set("spark.driver.maxResultSize", "8g");
conf.set("spark.memory.fraction", "0.6");
conf.set("spark.memory.storageFraction", "0.5");
conf.set("spark.sql.shuffle.partitions", "5");
conf.set("spark.memory.offHeap.enabled", "false");
conf.set("spark.reducer.maxSizeInFlight", "96m");
conf.set("spark.shuffle.file.buffer", "256k");
conf.set("spark.sql.debug.maxToStringFields", "100");

我就是这样创建crossvalidator的

ParamMap[] paramGrid = new ParamGridBuilder()
            .addGrid(gbt.maxBins(), new int[]{50})
            .addGrid(gbt.maxDepth(), new int[]{2, 5, 10})
            .addGrid(gbt.maxIter(), new int[]{5, 20, 40})
            .addGrid(gbt.minInfoGain(), new double[]{0.0d, .1d, .5d})
            .build();

    CrossValidator gbcv = new CrossValidator()
            .setEstimator(gbt)
            .setEstimatorParamMaps(paramGrid)
            .setEvaluator(gbevaluator)
            .setNumFolds(5)
            .setParallelism(8)
            .setSeed(session.getArguments().getTrainingRandom());

问题是,当(在paramgrid中)maxdepth只有{2,5}和maxiter{5,20}时,所有的工作都很好,但是当它像上面的代码一样时,它会继续记录: WARN DAGScheduler: broadcasting large task binary with size xx ,xx从1000 kib变为2.9 mib,经常导致超时异常,我应该更改哪些spark参数来避免这种情况?

i7uq4tfw

i7uq4tfw1#

对于超时问题,请考虑更改以下配置:
spark.sql.autobroadcastjointhreshold设置为-1。
这将消除10mb的广播大小限制。

相关问题