带分区的hive groupby

l5tcr1uw 于 2021-06-27 发布在 Hive

关注(0)|答案(0)|浏览(227)

我根据列值（日期）在配置单元中划分数据。所以每个日期在/warehouse中都有自己的目录。现在我有大约240个日期，总共有7000万条记录平均分布在各个日期。
我还创建了另一个表，其中包含没有分区的相同数据。
当我用相同的查询查询两个表时，分区表并不总是比未分区的表执行得更好。更具体地说，使用GROUPBY执行查询时，分区表的速度较慢。

select count(*) from not_partitioned_table where date > '2018-07-27' and date < '2018-08-27

这花了22.146秒，计数是7427366。

select count(*) from partitioned_table where date > '2018-07-27' and date < '2018-08-27

这花了22.723秒，还返回7427366进行计数。
但是，添加groupby时，分区表的性能比未分区表差。

select count(*) from not_partitioned_table where dated > '2018-07-27' and date < '2018-08-27' group by col_name;

这花费了39.733秒，返回了26724行。

select count(*) from partitioned_table where dated > '2018-07-27' and date < '2018-08-27' group by col_name;

这花费了76.648秒，返回了26724行。
为什么分区表在这种情况下速度较慢？
编辑
以下是创建分区表的方法：

CREATE TABLE all_ads_from_csv_partitioned3(
id STRING,
...
)
PARTITIONED BY(datedecoded STRING)
STORED AS ORC;

以及以下 2018-10-08 15:34 /warehouse/tablespace/managed/hive/partitioned_table/ ，共有240个目录（240个分区），每个目录的格式为 /warehouse/tablespace/managed/hive/partitioned_table/dated='the partitioned date' ，每个分区包含大约10个存储桶。

Hive Partition

来源：https://stackoverflow.com/questions/52698462/hive-groupby-slower-with-partition

暂无答案！

目前还没有任何答案，快来回答吧！

我来回答

带分区的hive groupby

暂无答案！

相关问题

热门标签

最新问答