hive-filter不同的名称

fykwrbwg  于 2021-05-29  发布在  Hadoop
关注(0)|答案(1)|浏览(364)

我´我被一些电影片名过滤掉了。我的问题是我有很多不同的电影片名,例如:

Movies:                                       Visitors:
    Breaking Dawn Part 1+2                    100
    Breaking Dawn 1+2                         40
    Breaking Dawn 1 + 2                       30
    Dark Knight trilogy 3D                    100
    Dark Knight trilogy  3D                   40
    Dark Knight Trilogy HDF                   30
    Dark Knight Trilogy -HDF                  100
    Dark Knight trilogy_ (blank)              44
    etc. +10000

所以有很多不同的电影片名´t命名为unique,并且在末尾还有一些空格。我可以稍微解决一下这个问题,但是已经有很多滴度了,它们有相同的含义,但是不同。目前我的问题是:

SELECT  regexp_replace(rtrim(allcinemadata.title)," - 3D | - 3D |3D |3D| 3D| - (3D) | - (3D) |(3D)"|"") 
as clearTitle
FROM default.allcinemadata
group by
regexp_replace(rtrim(allcinemadata.title)," - 3D | - 3D |3D |3D| 3D| - (3D) | - (3D) |(3D)", "")

我´我不确定那是不是解决那个问题的最好办法。希望你们能帮我!:)

ffscu2ro

ffscu2ro1#

我无法用更大的数据集来测试这个问题,但它适用于您在问题中提供的示例数据。基于soundex函数对电影名称获取总的视图,其余的sql都是不言自明的。

WITH movviews
     AS (SELECT moviename,
                totalviews,
                Rank()
                  OVER (
                    partition BY Soundex(moviename)
                    ORDER BY totalviews DESC) rnk
         FROM   (SELECT moviename,
                        Sum(views)
                          OVER (
                            partition BY Soundex(moviename)
                            ORDER BY moviename) AS totalviews,
                        views
                 FROM   movieviews
                 ORDER  BY moviename)vv)
SELECT movviews.moviename,
       movviews.totalviews
FROM   movviews
WHERE  rnk = 1

输出

movviews.moviename             movviews.totalviews
Breaking Dawn Part 1+2          170
Dark Knight trilogy_ (blank)    314
Time taken: 62.257 seconds, Fetched: 2 row(s)
hive (default)>

相关问题