javaspark数据集可以选择,但不能groupby、filter或aggregate

d7v8vwbk  于 2021-05-24  发布在  Spark
关注(0)|答案(0)|浏览(595)

我想使用java,通过使用javasparksql的dataset/dataframe,对我的数据进行汇总。但是,它会抛出一个错误:

Job aborted due to stage failure: Task serialization failed: java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.storage.StorageUtils$
java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.storage.StorageUtils$
    at org.apache.spark.util.io.ChunkedByteBufferOutputStream.toChunkedByteBuffer(ChunkedByteBufferOutputStream.scala:118)
    at org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:295)
    at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:127)
    at org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:88)
    at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
    at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
    at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1489)
    at org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1163)
    at org.apache.spark.scheduler.DAGScheduler.submitStage(DAGScheduler.scala:1071)
    at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:1014)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2069)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2061)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2050)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)

我的代码如下:

Dataset<Row> dataset = sparkSession.createDataFrame(rdd, MyPojo.class); // where rdd has type JavaRDD<MyPojo>
dataset.collectAsList();

为什么会抛出这个错误?

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题