如何对多个目录和多个文件运行配置单元查询

fbcarpbf 于 2021-06-02 发布在 Hadoop

关注(0)|答案(2)|浏览(392)

我想对包含多个文件的多个目录运行count配置单元查询。文件路径如下所示

'2011/01/01/file20110101_01.csv
 2011/01/01/file20110101_02.csv
 2011/01/02/file20110201_01.csv
 2011/01/02/file20110201_02.csv'

等等。
我用以下分区创建了一个外部表

'create external table table1(col1,col2...)
  partitioned by (year string,month string)
  STORED AS TEXTFILE'

直到这个月才加分区。

'ALTER TABLE partition_test_production1 ADD PARTITION(year='2011', month='01')
LOCATION 'blob path/2011/01/*/file201101*.csv';'

已尝试此查询

'select count(1) from table1 where year='2011' AND month='01';'

但计数显示为零。有什么建议吗？

sql hadoop Hive hiveql azure-hdinsight

来源：https://stackoverflow.com/questions/23550115/how-to-run-a-hive-query-on-multiple-dirs-with-multiple-files

2条答案

按热度按时间

h6my8fg21#

您不必单独添加所有文件，但必须单独添加所有底层目录。添加目录时，配置单元将读取该目录中的所有文件，但不会读取子目录中的任何文件。例如：

create external table table1(col1,col2...)
partitioned by (year string, month string, day string)
STORED AS TEXTFILE

ALTER TABLE table1 ADD PARTITION(year='2011', month='01', day='01')
LOCATION 'hdfs:///path/2011/01/01/';

ALTER TABLE table1 ADD PARTITION(year='2011', month='01', day='02')
LOCATION 'hdfs:///path/2011/01/02/';

etc

通常你会有一个bash脚本或者其他的东西来做这个。循环hdfs中的所有目录并生成hive语句来添加该分区。我不是bashMaven，但举个例子：

hadoop fs -ls hdfs:///path/*/* | while read line; do
  year="$(echo "$line" | awk -F/ '{print $(NF-2)}')"
  month="$(echo "$line" | awk -F/ '{print $(NF-1)}')"
  day="$(echo "$line" | awk -F/ '{print $(NF)}')"
  hive -e "alter table table1 add partition(year='$year', month='$month', day='$day') location 'hdfs:///path/$year/$month/$day'"
done

似乎有一些关于表/分区位置更灵活的旧jira票据，但它们都没有解决。

赞(0）回复(0）举报 2021-06-03

drkbr07n2#

您实际上不需要手动创建分区。如果您已经创建了一个外部表，并且数据驻留在此目录中，那么您可以运行msck repair table table\ u name，它将自动加载所有分区。

赞(0）回复(0）举报 2021-06-03