
2wnc66cl  于 2021-05-27  发布在  Spark

在读取了一个users目录中的文件之后,我做了一些转换,我需要对它们进行整体处理(比如根据数据删除一些重复的文件)。为此我打电话给 collect() 通过rdd。
当用10个目录运行时,它运行得很好,但是当用1000个目录运行时,它会被卡住 collect() 打电话。

private lazy val sparkSession = SparkSession
.appName("Custom Job")


val allDirs: Seq[String] = fs.getAllDirInPath(Configuration.inputDir)
val paths: RDD[String] = SessionWrapper.getSparkContext.parallelize(allDirs)

变换和 collect 电话:

paths.foreachPartition { partition =>
      partition.foreach { dir =>
        val dirData = readDataByDir(dir) // RDD[String]
        val transformed = doTranform(dirData) // RDD[CustomObject]
        val collectedData = tranformed.collect()
        // Do something on collected data


20/09/09 19:24:40 INFO SparkContext: Starting job: collect at MyCustomHelperWithCollectCall.scala:18
20/09/09 19:24:40 INFO FileInputFormat: Total input paths to process : 3
20/09/09 19:24:40 INFO CombineFileInputFormat: DEBUG: Terminated node allocation with : CompletedNodes: 1, size left: 65935
20/09/09 19:24:40 INFO DAGScheduler: Got job 2 (collect at MyCustomHelperWithCollectCall.scala:18) with 2 output partitions
20/09/09 19:24:40 INFO DAGScheduler: Final stage: ResultStage 2 (collect at MyCustomHelperWithCollectCall.scala:18)
20/09/09 19:24:40 INFO DAGScheduler: Parents of final stage: List()
20/09/09 19:24:40 INFO SparkContext: Starting job: collect at MyCustomHelperWithCollectCall.scala:18
20/09/09 19:24:40 INFO SparkContext: Starting job: collect at MyCustomHelperWithCollectCall.scala:18
20/09/09 19:24:40 INFO DAGScheduler: Missing parents: List()
20/09/09 19:24:40 INFO SparkContext: Starting job: collect at MyCustomHelperWithCollectCall.scala:18
20/09/09 19:24:40 INFO SparkContext: Starting job: collect at MyCustomHelperWithCollectCall.scala:18
20/09/09 19:24:40 INFO SparkContext: Starting job: collect at MyCustomHelperWithCollectCall.scala:18
20/09/09 19:24:40 INFO DAGScheduler: Submitting ResultStage 2 (MapPartitionsRDD[102] at map at MyCustomHelperWithCollectCall.scala:18), which has no missing parents
20/09/09 19:24:40 INFO SparkContext: Starting job: collect at MyCustomHelperWithCollectCall.scala:18
20/09/09 19:24:40 INFO SparkContext: Starting job: collect at MyCustomHelperWithCollectCall.scala:18
20/09/09 19:24:40 INFO SparkContext: Starting job: collect at MyCustomHelperWithCollectCall.scala:18
20/09/09 19:24:40 INFO SparkContext: Starting job: collect at MyCustomHelperWithCollectCall.scala:18
20/09/09 19:24:40 INFO MemoryStore: Block broadcast_14 stored as values in memory (estimated size 7.2 KiB, free 2001.0 MiB)
20/09/09 19:24:40 INFO SparkContext: Starting job: collect at MyCustomHelperWithCollectCall.scala:18
20/09/09 19:24:40 INFO MemoryStore: Block broadcast_14_piece0 stored as bytes in memory (estimated size 3.5 KiB, free 2001.0 MiB)
20/09/09 19:24:40 INFO BlockManagerInfo: Added broadcast_14_piece0 in memory on (size: 3.5 KiB, free: 2004.3 MiB)
20/09/09 19:24:40 INFO SparkContext: Created broadcast 14 from broadcast at DAGScheduler.scala:1200
20/09/09 19:24:40 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 2 (MapPartitionsRDD[102] at map at MyCustomHelperWithCollectCall.scala:18) (first 15 tasks are for partitions Vector(0, 1))
20/09/09 19:24:40 INFO TaskSchedulerImpl: Adding task set 2.0 with 2 tasks
20/09/09 19:24:40 INFO DAGScheduler: Got job 3 (collect at MyCustomHelperWithCollectCall.scala:18) with 1 output partitions
20/09/09 19:24:40 INFO DAGScheduler: Final stage: ResultStage 3 (collect at MyCustomHelperWithCollectCall.scala:18)
20/09/09 19:24:40 INFO DAGScheduler: Parents of final stage: List()
20/09/09 19:24:40 INFO DAGScheduler: Missing parents: List()
20/09/09 19:24:40 INFO DAGScheduler: Submitting ResultStage 3 (MapPartitionsRDD[96] at map at MyCustomHelperWithCollectCall.scala:18), which has no missing parents
20/09/09 19:24:40 INFO MemoryStore: Block broadcast_15 stored as values in memory (estimated size 7.2 KiB, free 2001.0 MiB)
20/09/09 19:24:40 INFO MemoryStore: Block broadcast_15_piece0 stored as bytes in memory (estimated size 3.5 KiB, free 2001.0 MiB)
20/09/09 19:24:40 INFO BlockManagerInfo: Added broadcast_15_piece0 in memory on (size: 3.5 KiB, free: 2004.3 MiB)
20/09/09 19:24:40 INFO SparkContext: Created broadcast 15 from broadcast at DAGScheduler.scala:1200
20/09/09 19:24:40 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 3 (MapPartitionsRDD[96] at map at MyCustomHelperWithCollectCall.scala:18) (first 15 tasks are for partitions Vector(0))
20/09/09 19:24:40 INFO TaskSchedulerImpl: Adding task set 3.0 with 1 tasks
20/09/09 19:24:40 INFO DAGScheduler: Got job 4 (collect at MyCustomHelperWithCollectCall.scala:18) with 1 output partitions
20/09/09 19:24:40 INFO DAGScheduler: Final stage: ResultStage 4 (collect at MyCustomHelperWithCollectCall.scala:18)
20/09/09 19:24:40 INFO DAGScheduler: Parents of final stage: List()
20/09/09 19:24:40 INFO DAGScheduler: Missing parents: List()
20/09/09 19:24:40 INFO DAGScheduler: Submitting ResultStage 4 (MapPartitionsRDD[101] at map at MyCustomHelperWithCollectCall.scala:18), which has no missing parents
20/09/09 19:24:40 INFO FileInputFormat: Total input paths to process : 5







