如何优化连接？

egdjgwm8 于 2021-06-26 发布在 Hive

关注(0)|答案(2)|浏览(283)

我有一个查询来连接这些表。如何优化以更快地运行它？

val q = """
          | select a.value as viewedid,b.other as otherids
          | from bm.distinct_viewed_2610 a, bm.tets_2610 b
          | where FIND_IN_SET(a.value, b.other) != 0 and a.value in (
          |   select value from bm.distinct_viewed_2610)
          |""".stripMargin
val rows = hiveCtx.sql(q).repartition(100)

表说明：

hive> desc distinct_viewed_2610;
OK
value                   string

hive> desc tets_2610;
OK
id                      int                                         
other                   string

数据如下所示：

hive> select * from distinct_viewed_2610 limit 5;
OK
1033346511
1033419148
1033641547
1033663265
1033830989

和

hive> select * from tets_2610 limit 2;
OK

1033759023
103973207,1013425393,1013812066,1014099507,1014295173,1014432476,1014620707,1014710175,1014776981,1014817307,1023740250,1031023907,1031188043,1031445197
``` `distinct_viewed_2610` 表有110万条记录，我正试图通过拆分第二列来从tetsਲ2610表中获得类似的id，tetsਲ2610表有200000行。
对于100000条记录，用两台机器完成这项工作需要8.5小时，一台是16 gb ram和16核，另一台是8 gb ram和8核。
有没有办法优化查询？
![](https://i.stack.imgur.com/1Rv6u.png)

Hive apache-spark apache-spark-sql query-optimization

来源：https://stackoverflow.com/questions/46970552/how-to-optimize-a-join

2条答案

按热度按时间

hc2pp10m1#

现在你在做笛卡尔连接。笛卡尔连接为您提供了1.1m*200k=2200亿行。笛卡尔之后加入它过滤 where FIND_IN_SET(a.value, b.other) != 0 分析你的数据。如果“other”字符串平均包含10个元素，那么分解它将得到表b中的220万行。如果假设只有1/10行连接，那么由于内部连接，您将有2.2m/10=220k行。
如果这些假设是正确的，那么分解数组和连接将比笛卡尔连接+过滤器性能更好。

select distinct a.value as viewedid, b.otherids
  from bm.distinct_viewed_2610 a
       inner join (select e.otherid, b.other as otherids 
                     from bm.tets_2610 b
                          lateral view explode (split(b.other ,',')) e as otherid
                  )b on a.value=b.otherid

你不需要这个：

and a.value in (select value from bm.distinct_viewed_2610)

对不起，我不能测试查询，请自己做。

赞(0）回复(0）举报 2021-06-26

5uzkadbs2#

如果你使用的是orc格式的变化Parquet根据你的数据，我会说选择范围分区。
选择适当的并行化以快速执行。
我已经回答了以下链接可能对你有帮助。
正在交换已正确分发的分区
也请读一下
http://dev.sortable.com/spark-repartition/

赞(0）回复(0）举报 2021-06-26

我来回答

如何优化连接？

2条答案

相关问题

热门标签

最新问答