从配置单元中的分组中选择单个随机样本

rnmwe5a2 于 2021-06-27 发布在 Hive

关注(0)|答案(1)|浏览(319)

我有一张这样的table：

Name      Age       Num_Hobbies     Num Shoes
Jane      31        10              2
Bob       23        3               4
Jane      60        2               200
Jane      31        100             6
Bob       10        8               7
etc etc

我想按名称和年龄将此表分组，并从其余列中随机选取一行。
在Pandas方面，我会做以下工作：

df.groupby(['Name', 'Age']).apply(lambda x: x.sample(n=1))

在hive中，我知道如何创建组，但不知道如何从组中选择单个随机样本。
我在stack overflow上看到了这个问题：如何为hive中的每个组采样？
但是，我不明白如何应用动态分区或配置单元bucketing从组中选择单个样本。

Hive group-by Random

来源：https://stackoverflow.com/questions/55463232/select-single-random-sample-from-group-by-in-hive

1条答案

按热度按时间

bogh5gae1#

你可以用 rank() 或者 row_number() 与 rand() ```
select * from
(
select name,age,rank() (partition by name,age order by rand()) as rank
from table
) t
where rank = 1

赞(0）回复(0）举报 2021-06-27

我来回答

从配置单元中的分组中选择单个随机样本

1条答案

相关问题

热门标签

最新问答