我正在分析自己的tweet,并使用hivejson-serde将数据插入到hive表中。我想以表格的形式找出我的tweet中所有两个单词短语的频率。输出应该类似于:
phrase frequency
["the","room"] 1248.0
["a","boy"] 1039.0
["rt","to"] 1032.0
["to","ct"] 986.0
现在,我可以对所有单字短语执行此操作,并得到如下输出:
phrase frequency
["the"] 1248.0
["a"] 1039.0
["rt"] 1032.0
["to"] 986.0
["you"] 828.0
对于单字短语输出,我的代码是:
create table ng(new_ar array<struct<ngram:array<string>,estfrequency:double>>);
INSERT OVERWRITE TABLE ng
SELECT context_ngrams(sentences(lower(text)),array(null),100) as word
FROM tweets;
create table wordFreq (ngram array<string>, estfrequency double);
INSERT OVERWRITE TABLE wordFreq
SELECT X.ngram, X.estfrequency
FROM ng LATERAL VIEW explode(new_ar) Z as X;
select * from wordFreq;
如何修改上述代码以获得所需的输出?
2条答案
按热度按时间7rfyedvj1#
下面的修改将在单独的一栏中给出这两个词。你可以连接它们
wlwcrazw2#
要将代码从1克更改为2克,请更改
array(null)
至array(null,null)
.