Spark SQL无法递归读取hive表的HDFS子文件夹(Spark - 2.4.6)

mwyxok5s 于 2022-12-09 发布在 HDFS

关注(0)|答案(1)|浏览(260)

我们正在尝试使用Spark-SQL读取配置单元表，但它未显示任何记录（输出中给出0条记录）。检查时，我们发现该表的HDFS文件存储在多个子目录中，如下所示-

hive> [hadoop@ip-10-37-195-106 CDPJobs]$ hdfs dfs -ls /its/cdp/refn/cot_tbl_cnt_hive/     
Found 18 items     
drwxrwxr-x+ - hadoop hadoop 0 2021-12-19 20:17 /its/cdp/refn/cot_tbl_cnt_hive/1     
drwxrwxr-x+ - hadoop hadoop 0 2021-12-19 20:17 /its/cdp/refn/cot_tbl_cnt_hive/10     
drwxrwxr-x+ - hadoop hadoop 0 2021-12-19 20:17 /its/cdp/refn/cot_tbl_cnt_hive/11     
drwxrwxr-x+ - hadoop hadoop 0 2021-12-19 20:17 /its/cdp/refn/cot_tbl_cnt_hive/12     
drwxrwxr-x+ - hadoop hadoop 0 2021-12-19 20:17 /its/cdp/refn/cot_tbl_cnt_hive/13     
drwxrwxr-x+ - hadoop hadoop 0 2021-12-19 20:17 /its/cdp/refn/cot_tbl_cnt_hive/14     
drwxrwxr-x+ - hadoop hadoop 0 2021-12-19 20:17 /its/cdp/refn/cot_tbl_cnt_hive/15

我们尝试在spark-defaults.conf文件中设置以下属性，但问题仍然存在。

set spark.hadoop.hive.supports.subdirectories = true;    
set spark.hadoop.hive.mapred.supports.subdirectories = true;     
set spark.hadoop.hive.input.dir.recursive=true;     
set mapreduce.input.fileinputformat.input.dir.recursive=true;          
set recursiveFileLookup=true;            
set spark.hive.mapred.supports.subdirectories=true;         
set spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive=true;

有人知道解决这个问题的方法吗？我们使用的是Spark 2.4.6版。

更新（找到解决方案）-

我已将此属性更改为false，现在spark可以从子目录读取数据。
如果您的数据库中有一个错误，请将其设置为false;

hdfs

来源：https://stackoverflow.com/questions/70446178/spark-sql-not-able-to-read-hdfs-subfolders-recursively-of-a-hive-table-spark

1条答案

按热度按时间

zxlwwiss1#

sparkSession = (SparkSession
                    .builder
                    .appName('USS - Unified Scheme of Sells')
                    .config("hive.metastore.uris", "thrift://probighhwm001:9083", conf=SparkConf())
                    .config("hive.input.dir.recursive", "true")
                    .config("hive.mapred.supports.subdirectories", "true")
                    .config("hive.supports.subdirectories", "true")
                    .config("mapred.input.dir.recursive", "true")
                    .enableHiveSupport()
                    .getOrCreate()
                    )

赞(0）回复(0）举报 2022-12-09

我来回答

Spark SQL无法递归读取hive表的HDFS子文件夹(Spark - 2.4.6)

1条答案

相关问题

热门标签

最新问答