我是新来的。我正在spark standalone(v3.0.0)中编写一个机器学习算法,配置如下:
SparkConf conf = new SparkConf();
conf.setMaster("local[*]");
conf.set("spark.driver.memory", "8g");
conf.set("spark.driver.maxResultSize", "8g");
conf.set("spark.memory.fraction", "0.6");
conf.set("spark.memory.storageFraction", "0.5");
conf.set("spark.sql.shuffle.partitions", "5");
conf.set("spark.memory.offHeap.enabled", "false");
conf.set("spark.reducer.maxSizeInFlight", "96m");
conf.set("spark.shuffle.file.buffer", "256k");
conf.set("spark.sql.debug.maxToStringFields", "100");
我就是这样创建crossvalidator的
ParamMap[] paramGrid = new ParamGridBuilder()
.addGrid(gbt.maxBins(), new int[]{50})
.addGrid(gbt.maxDepth(), new int[]{2, 5, 10})
.addGrid(gbt.maxIter(), new int[]{5, 20, 40})
.addGrid(gbt.minInfoGain(), new double[]{0.0d, .1d, .5d})
.build();
CrossValidator gbcv = new CrossValidator()
.setEstimator(gbt)
.setEstimatorParamMaps(paramGrid)
.setEvaluator(gbevaluator)
.setNumFolds(5)
.setParallelism(8)
.setSeed(session.getArguments().getTrainingRandom());
问题是,当(在paramgrid中)maxdepth只有{2,5}和maxiter{5,20}时,所有的工作都很好,但是当它像上面的代码一样时,它会继续记录: WARN DAGScheduler: broadcasting large task binary with size xx
,xx从1000 kib变为2.9 mib,经常导致超时异常,我应该更改哪些spark参数来避免这种情况?
1条答案
按热度按时间i7uq4tfw1#
对于超时问题,请考虑更改以下配置:
spark.sql.autobroadcastjointhreshold设置为-1。
这将消除10mb的广播大小限制。