hive group by contain repeat结果

nsc4cvqm  于 2021-06-02  发布在  Hadoop
关注(0)|答案(2)|浏览(353)

我的hql是:

select day,app_id,platform,count(1) as dau from (select ad.day,ads.app_id,ads.platform,ad.deviceid from mobile_ad_space ads inner join (select day,deviceid,flightid from mobile_day_adlog where day='$STAT_DAY') ad on cast(ads.space_id as string)=ad.flightid   group by ad.day,ads.app_id,ads.platform,ad.deviceid) day_active group by day,app_id,platform having dau>5;

但结果包含重复数据:

| day        | appid | platform | dau|

| 2016-06-29 |   1 | ios      | 70533 |

| 2016-06-29 |   1 | android  | 49307 |

| 2016-06-29 |   1 | android  | 49307 |

| 2016-06-29 |   1 | android  | 49307 |

我也尝试过其他hql:

SELECT day_active.day  ,day_active.app_id  ,day_active.platform  ,count(1) AS dau FROM (  SELECT day_device.day AS day   ,day_device.app_id AS app_id   ,day_device.platform AS platform   ,day_device.deviceid AS deviceid  FROM     (SELECT ad.day AS day   ,ads.app_id AS app_id   ,ads.platform AS platform   ,ad.deviceid AS deviceid  FROM mobile_ad_space ads  INNER JOIN (   SELECT day    ,deviceid    ,flightid   FROM mobile_day_adlog   WHERE day = '$STAT_DAY'   ) ad ON cast(ads.space_id AS string) = ad.flightid) day_device                GROUP BY day_device.day   ,day_device.app_id   ,day_device.platform   ,day_device.deviceid  ) day_active GROUP BY day_active.day  ,day_active.app_id  ,day_active.platform HAVING dau>5 ORDER BY day_active.day  ,day_active.app_id  ,day_active.platform;

但它仍然有这个问题
有人能帮我吗?

juud5qan

juud5qan1#

最后,我找到了解决办法。
在我的hql中,我使用 insert overwrite directory '$HQL_OUT_PATH' 覆盖输出。但是它看起来不稳定,导致重复项。因此,我在hql之前清理了输出路径,结果现在就出来了~!

mlnl4t2r

mlnl4t2r2#

尝试检查值是否包含空格,请使用group by中的trim()。还要按列检查每组有多少不同的值: select distinct platform ... 等等

相关问题