我试图保存我的SparkDataframe(齐柏林飞艇笔记本上运行的电子病历)到gluecatalog在我的同一个aws帐户。方法 saveAsTable()
当我不使用 bucketBy()
. 当我用它的时候,我会得到 UnknownHostException
那个主机名不在我的电子病历里。当我更改数据库名称时,会报告一个不同的主机名。
我的问题是:该主机名的配置在哪里?这是干什么用的?为什么 bucketBy
需要吗?
谢谢你的帮助。阿弗雷尔
spark.sql("use my_database_1")
my_df.write.partitionBy("dt").mode("overwrite").bucketBy(10, "id").option("path","s3://my-bucket/").saveAsTable("my_table")
java.lang.IllegalArgumentException: java.net.UnknownHostException: ip-10-10-10-71.ourdc.local
at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:418)
at org.apache.hadoop.hdfs.NameNodeProxiesClient.createProxyWithClientProtocol(NameNodeProxiesClient.java:132)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:351)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:285)
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:160)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2859)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2896)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2878)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:392)
at org.apache.spark.sql.hive.HiveExternalCatalog.saveTableIntoHive(HiveExternalCatalog.scala:496)
at org.apache.spark.sql.hive.HiveExternalCatalog.org$apache$spark$sql$hive$HiveExternalCatalog$$createDataSourceTable(HiveExternalCatalog.scala:399)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply$mcV$sp(HiveExternalCatalog.scala:263)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:236)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:236)
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
at org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:236)
at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.createTable(ExternalCatalogWithListener.scala:94)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:324)
at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:185)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:156)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
at org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:474)
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:453)
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:409)
... 47 elided
Caused by: java.net.UnknownHostException: ip-10-10-10-71.ourdc.local
... 87 more
1条答案
按热度按时间wgxvkvu91#
我的问题有两个不同的问题:
主机名的来源
为什么只有在使用bucketby时才发现问题。
对于问题(1),我们的glue db是使用
spark.sql("create database mydb")
. 这将创建一个glue数据库,其位置设置为hdfs路径,默认为emr主ip地址。10.10.10.71是我们旧电子病历的ip地址(已终止)对于问题(2),似乎
bucketBy
以及sortBy
,在写入最终目的地之前,spark需要一些临时空间。临时空间的位置是数据库的默认位置,完整路径是<db_location>-<table_name>-__PLACEHOLDER__
修复:(1)需要修改glue中数据库的位置。在(2)上不需要/不能做任何事情