我需要帮助找出为什么我的spark作业使用spark bigquery连接器在临时gcsbucket的写入步骤中失败。下面是试图将数据写入bq表的spark代码
outDF.write
.format("bigquery")
.option("temporaryGcsBucket", "bq_temporary_folder")
.option("parentProject", "user_project")
.option("table", "user.destination_table")
.mode(SaveMode.Overwrite)
.save()
以下是错误日志:
Caused by: java.lang.IllegalArgumentException: Wrong bucket: bq_temporary_folder, in path: gs://bq_temporary_folder/.spark-bigquery-application_1605725797163_1966-6f6cfc35-543b-4138-ae0e-649ce7c2ae56, expected bucket: null
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem.checkPath(GoogleHadoopFileSystem.java:89)
at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:581)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.makeQualified(GoogleHadoopFileSystemBase.java:454)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.<init>(FileOutputCommitter.java:144)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.<init>(FileOutputCommitter.java:103)
at org.apache.parquet.hadoop.ParquetOutputCommitter.<init>(ParquetOutputCommitter.java:43)
at org.apache.parquet.hadoop.ParquetOutputFormat.getOutputCommitter(ParquetOutputFormat.java:442)
at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.setupCommitter(HadoopMapReduceCommitProtocol.scala:100)
at org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol.setupCommitter(SQLHadoopMapReduceCommitProtocol.scala:40)
at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.setupTask(HadoopMapReduceCommitProtocol.scala:217)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:226)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:175)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:405)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
我使用的是sparkbq连接器spark-bigquery-with-dependencies\ u2.12-0.18.0.jar和gcs连接器gcs-connector-hadoop2-2.0.0-shaded.jar。
我不认为这是许可问题。因为spark作业能够通过调用rdd.saveastextfile()写入同一gcs bucket
当我检查googlehadoopfilesystem代码时,出现了预期的bucket:null错误消息,因为没有设置googlehadoopfilesystembase类中的rootbucket。但我不知道如何在第一时间初始化这个字段。
有人告诉我,我的spark作业和spark群集之间可能存在不匹配的gcs连接器版本。即使在确认spark作业正在使用为spark群集配置的gcs连接器库之后,该作业仍然失败。
我现在不知道该怎么办了。
提前感谢您的建议/帮助。
暂无答案!
目前还没有任何答案,快来回答吧!