spark select语句导致作业关闭而不记录任何错误

vsnjm48y  于 2021-05-27  发布在  Spark
关注(0)|答案(0)|浏览(239)

我是新来这里Spark和尝试我的手Spark。在我的项目中,我尝试从s3 bucket读取数据,然后创建Dataframe,并根据条件从中进行过滤。将spark作业放入select选项的行将在没有任何错误的情况下关闭,并且我没有经过筛选的帧。尝试给出的print schema也可以正常工作,即使它打印Dataframe计数,但遇到select query或write to s3请求,将调用shutdown hook并关闭整个进程。我的命令是:-

spark-submit --master "local[*]" --class com.test.action.myjob test_job.jar

特定事件的日志是

2020-07-17 14:23:27 INFO  FileSourceStrategy:54 - Pruning directories with: 
  2020-07-17 14:23:27 INFO  FileSourceStrategy:54 - Post-Scan Filters: isnotnull(fieldname#11),(name#11 = my_filter_condition)
  2020-07-17 14:23:27 INFO  FileSourceStrategy:54 - Output Data Schema: struct<metadata-sdk_timestamp: string, name: string, object_data-_scheduled_rental_requests_enabled: boolean, object_data-account_group: string, object_data-action: string ... 423 more fields>
  2020-07-17 14:23:27 INFO  FileSourceScanExec:54 - Pushed Filters: IsNotNull(name),EqualTo(fieldname,my_filter_condition)
  2020-07-17 14:23:27 INFO  SparkContext:54 - Invoking stop() from shutdown hook

当尝试在不同的数据集上成功运行同一作业时,会生成如下日志:-

2020-07-17 15:24:13 INFO  FileSourceStrategy:54 - Pruning directories with: 
2020-07-17 15:24:13 INFO  FileSourceStrategy:54 - Post-Scan Filters: isnotnull(name#11),(fieldname#11 = my_filter_condition)
2020-07-17 15:24:13 INFO  FileSourceStrategy:54 - Output Data Schema: struct<metadata-sdk_timestamp: string, name: string, object_data-account_group: string, object_data-action: string, object_data-additional_mileage_charge: double ... 289 more fields>
2020-07-17 15:24:13 INFO  FileSourceScanExec:54 - Pushed Filters: IsNotNull(name),EqualTo(fieldname,my_filter_condition)
2020-07-17 15:24:13 DEBUG S3AFileSystem:60 - op_exists += 1  ->  3
2020-07-17 15:24:13 DEBUG S3AFileSystem:60 - op_get_file_status += 1  ->  48
2020-07-17 15:24:13 DEBUG S3AFileSystem:1782 - Getting path status for s3a:

我们可以从日志中看到,在应用filter选项时,在第一种情况下,负载会失败。
代码段如下:-

val dataFrame = sparkSession.read
        .option("mergeSchema", "true")
        .format("parquet")
        .load("s3//path")
println(dataFrame.count())
dataFrame.printSchema()
val print_data_frame = dataFrame.select("name")
print_data_frame.show(100)

从s3中读取数据工作正常,count也被打印出来。但是调用spark作业时的select语句将关闭。我试过很多东西都没用。失败的原因可能是什么,我尝试将日志置于调试级别,但它没有告诉我为什么filter选项失败。任何线索或暗示都很好。

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题