如何使用hiveql在配置单元表中获得ngrams数组字符串和estfrequency作为独立元素?

xqk2d5yq  于 2021-05-30  发布在  Hadoop
关注(0)|答案(2)|浏览(537)

我正在分析自己的tweet,并使用hivejson-serde将数据插入到hive表中。我想以表格的形式找出我的tweet中所有两个单词短语的频率。输出应该类似于:

phrase             frequency
["the","room"]      1248.0
["a","boy"]        1039.0
["rt","to"]        1032.0
["to","ct"]         986.0

现在,我可以对所有单字短语执行此操作,并得到如下输出:

phrase     frequency
["the"]     1248.0
["a"]       1039.0
["rt"]      1032.0
["to"]      986.0
["you"]     828.0

对于单字短语输出,我的代码是:

create table ng(new_ar array<struct<ngram:array<string>,estfrequency:double>>);

INSERT OVERWRITE TABLE ng 
SELECT context_ngrams(sentences(lower(text)),array(null),100) as word 
FROM tweets;

create table wordFreq (ngram array<string>,  estfrequency double);

INSERT OVERWRITE TABLE wordFreq 
SELECT X.ngram, X.estfrequency 
FROM ng LATERAL VIEW explode(new_ar) Z as X;    

select * from wordFreq;

如何修改上述代码以获得所需的输出?

7rfyedvj

7rfyedvj1#

下面的修改将在单独的一栏中给出这两个词。你可以连接它们

create table wordFreq (word1 string, word2 string,  estfrequency double);

INSERT OVERWRITE TABLE wordFreq 
SELECT X.ngram[0],X.ngram[1], X.estfrequency 
FROM ng LATERAL VIEW explode(new_ar) Z as X;
wlwcrazw

wlwcrazw2#

要将代码从1克更改为2克,请更改 array(null)array(null,null) .

相关问题