如何使用sparksql识别配置单元表中的分区列

6yjfywim 于 2021-06-27 发布在 Hive

关注(0)|答案(1)|浏览(502)

我正在尝试使用spark识别配置单元表中的分区列名。我可以使用show partitions然后解析resultset来提取分区列。但是，缺点是，如果某些故事中没有分区，show分区失败了。有没有更有机的方法来标识配置单元表中的分区列名。任何帮助都将不胜感激

v_query="show partitions {}".format(table_name)
a=self.spark.sql(v_query)
val=a.rdd.map(list).first()
val1=''.join(val)
partition_list=[l.split('=')[0] for l in val1.split('/')]

Hive python apache-spark pyspark pyspark-sql

来源：https://stackoverflow.com/questions/58071279/how-to-identify-the-partition-columns-in-hive-table-using-spark-sql

1条答案

按热度按时间

9gm1akwq1#

如果表未分区，上述代码将失败。它会给你一个错误信息，如 "pyspark.sql.utils.AnalysisException: u'SHOW PARTITIONS is not allowed on a table that is not partitioned" 您可以在上使用Map操作 desc 命令获取分区列信息。

>>> spark.sql("""desc db.newpartitiontable""").show(truncate=False)
+-----------------------+---------+-------+
|col_name               |data_type|comment|
+-----------------------+---------+-------+
|city                   |string   |null   |
|state                  |string   |null   |
|country                |string   |null   |
|tran_date              |string   |null   |
|# Partition Information|         |       |
|# col_name             |data_type|comment|
|tran_date              |string   |null   |
+-----------------------+---------+-------+

partition_exists=spark.sql("""desc db.newpartitiontable""").rdd.map(lambda x:x[0]).filter(lambda x: x.startswith("# col_name")).collect()

if len(partition_exists)>0:
    print("table is partitioned")
    partition_col=spark.sql("""show partitions test_dev_db.newpartitiontable""").rdd.map(lambda x:x[0]).map(lambda x : [l.split('=')[0] for l in x.split('/')]).first()
    print("Partition column is:",partition_col)
else:
    print("table is not partitioned")

赞(0）回复(0）举报 2021-06-27

我来回答

如何使用sparksql识别配置单元表中的分区列

1条答案

相关问题

热门标签

最新问答