Apache Spark 在AWS EMR中,Apache Iceberg表无法与AWS Glue配合使用

jhkqcmku  于 2023-01-02  发布在  Apache
关注(0)|答案(1)|浏览(246)

我正在尝试从存储在S3中的apache冰山格式的glue目录加载na spark EMR集群中的表。该表创建正确,因为我可以从AWS Athena查询它。在创建集群时,我设置了以下配置:

[{"classification":"iceberg-defaults","properties":{"iceberg.enabled":"true"}}]

IK尝试过从spark运行其他格式(csv)的sql查询,它工作正常,但当我尝试读取冰山表时,我得到了这个错误:

org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table table_name. StorageDescriptor#InputFormat cannot be null for table: table_name(Service: null; Status Code: 0; Error Code: null; Request ID: null; Proxy: null)

这是笔记本中的代码:

%%configure -f
{
"conf":{
    "spark.sql.extensions":"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
    "spark.sql.catalog.dev":"org.apache.iceberg.spark.SparkCatalog",
    "spark.sql.catalog.dev.type":"hadoop",
    "spark.sql.catalog.dev.warehouse":"s3://pyramid-streetfiles-sbx/iceberg_test/"
    }
}

from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
import pyspark.sql.types as t

spark = SparkSession.builder.getOrCreate()

# This query works and shows the iveberg table i want to read
spark.sql("show tables from iceberg_test").show(truncate=False)

# Here shows the error
spark.sql("select * from iceberg_test.table_name limit 10").show(truncate=False)

如何使用Spark和Glue目录读取EMR集群中的Apache冰山表?

相关问题