如何在配置单元中生成所有n-gram

vmjh9lq9  于 2021-06-01  发布在  Hadoop
关注(0)|答案(1)|浏览(380)

我想使用hiveql创建一个n-gram列表。我的想法是使用一个具有前瞻性和split函数的正则表达式-但是这不起作用:

select split('This is my sentence', '(\\S+) +(?=(\\S+))');

输入是表单的一列

|sentence                 |
|-------------------------|
|This is my sentence      |
|This is another sentence |

输出应该是:

["This is","is my","my sentence"]
["This is","is another","another sentence"]

在hive中有一个n-grams自定义项,但是这个函数直接计算n-grams的频率-我想要一个所有n-grams的列表。
提前多谢了!

hjqgdpho

hjqgdpho1#

这也许不是最理想的解决方案,但非常有效。用分隔符拆分句子(在我的例子中是一个或多个空格或逗号),然后分解并连接得到n-gram,然后使用 collect_set (如果您需要唯一的n-grams)或 collect_list :

with src as 
(
select source_data.sentence, words.pos, words.word
  from
      (--Replace this subquery (source_data) with your table
       select stack (2,
                     'This is my sentence', 
                     'This is another sentence'
                     ) as sentence
      ) source_data 
        --split and explode words
        lateral view posexplode(split(sentence, '[ ,]+')) words as pos, word
)

select s1.sentence, collect_set(concat_ws(' ',s1.word, s2.word)) as ngrams 
      from src s1 
           inner join src s2 on s1.sentence=s2.sentence and s1.pos+1=s2.pos              
group by s1.sentence;

结果:

OK
This is another sentence        ["This is","is another","another sentence"]
This is my sentence             ["This is","is my","my sentence"]
Time taken: 67.832 seconds, Fetched: 2 row(s)

相关问题