从COLLECT_SET到具有1和0的宽表

8cdiaqws 于 2022-10-21 发布在 Hive

关注(0)|答案(1)|浏览(186)

我有一个包含170万行的大型数据库。一列是由collect_set生成的列表。我想把这个列表分解成一个1/0布尔表
Hive暂不支持Pivot，因此不接受使用该函数的答案。
我有一张table：

id   | list_center                              |
-----|------------------------------------------|
0788 | []                                       |
0568 | ["Lorem"]                                |
0879 | ["Lorem","ipsum"]                        |
0025 | ["who", "exercise", "train"]             |
0365 | ["ipsum", "airplane", "tariff", "lorem"] |

预期结果：

id   | lorem | ipsum  | who | exercise | train | airplane | tariff |
-----|-------|--------|-----|----------|-------|----------|--------|
0788 |   0   |   0    |  0  |    0     |   0   |    0     |    0   |
0568 |   1   |   0    |  0  |    0     |   0   |    0     |    0   |
0879 |   1   |   1    |  0  |    0     |   0   |    0     |    0   |
0025 |   0   |   0    |  1  |    1     |   1   |    0     |    0   |
0365 |   1   |   1    |  0  |    0     |   0   |    1     |    1   |

Hive

来源：https://stackoverflow.com/questions/74057613/from-a-collect-set-to-a-wide-table-with-1-and-0

1条答案

按热度按时间

vmdwslir1#

我不确定我能不能回答，但我会试着解释一下。我重新创建了输入表，并尝试使用Spark SQL而不是HiveQL来处理它。SQL系列中的语法相似，因此我希望您能找到有用的想法。
基本上，我必须“重做”您的collect_set结果(使用explode)。因此，您可能需要在collect_set转换之前使用数据集进行透视。
这不会为“id”=0788创建一行，但它更短。

SELECT *
FROM (SELECT id, explode(list_center) list_center FROM Table)
PIVOT (
    count(1)
    FOR list_center IN ('lorem', 'ipsum', 'who', 'exercise', 'train', 'airplane', 'tariff')
)

+----+-----+-----+----+--------+-----+--------+------+
|  id|lorem|ipsum| who|exercise|train|airplane|tariff|
+----+-----+-----+----+--------+-----+--------+------+
|0365|    1|    1|null|    null| null|       1|     1|
|0568| null| null|null|    null| null|    null|  null|
|0879| null|    1|null|    null| null|    null|  null|
|0025| null| null|   1|       1|    1|    null|  null|
+----+-----+-----+----+--------+-----+--------+------+

为了获得缺少的行，我认为您需要一个交叉连接。

WITH exploded AS (SELECT id, explode(list_center) list_center, 1 cnt FROM Table)
SELECT *
FROM (SELECT id from Table)
CROSS JOIN (SELECT DISTINCT list_center FROM exploded)
FULL JOIN exploded
USING (id, list_center)
PIVOT (
    coalesce(first(cnt), 0)
    FOR list_center IN ('lorem', 'ipsum', 'who', 'exercise', 'train', 'airplane', 'tariff')
)

+----+-----+-----+---+--------+-----+--------+------+
|  id|lorem|ipsum|who|exercise|train|airplane|tariff|
+----+-----+-----+---+--------+-----+--------+------+
|0365|    1|    1|  0|       0|    0|       1|     1|
|0788|    0|    0|  0|       0|    0|       0|     0|
|0568|    0|    0|  0|       0|    0|       0|     0|
|0879|    0|    1|  0|       0|    0|       0|     0|
|0025|    0|    0|  1|       1|    1|       0|     0|
+----+-----+-----+---+--------+-----+--------+------+

在Oracle中，当使用pivot时，我们不一定需要提供所有值，只需提供FOR list_center IN ()即可。但在Spark SQL中，这是不可能的。希望HiveQL在这一点上是灵活的。

赞(0）回复(0）举报 2022-10-21

我来回答

从COLLECT_SET到具有1和0的宽表

1条答案

相关问题

热门标签

最新问答