我有一个包含1000行和3个变量(id、country和字符串变量“var1”)的表。var1是由空格分隔的单词组成的句子。
我要按国家统计所有成对(或三胞胎)的单词。非常重要的是,夫妻(或三胞胎)是所有单词的交叉(不一定是一步一步)。也许,我们可以用一个ngrams hql函数来实现,但是当我使用它时,它是一步一步地计算单词,而不是所有的交叉。
让我们举个例子来告诉你我想要什么:
>**"ID" "COUNTRY" "VAR1"**
> "1" "CANADA" "dad mum child"
> "2" "CANADA" "dad mum dog"
> "3" "USA" "bird lion car"
var1不一定是3个单词的长度。只是为了简化。
我想要一个2-ngrams的结果分为4个步骤:
第一步:最重要的一步:用两个字交叉
> "1" "CANADA" "dad mum" 1
> "1" "CANADA" "dad child" 1
> "1" "CANADA" "mum dad" 1
> "1" "CANADA" "mum child" 1
> "1" "CANADA" "child dad" 1
> "1" "CANADA" "child mum" 1
> "2" "CANADA" "dad mum" 1
> "2" "CANADA" "dad dog" 1
> "2" "CANADA" "mum dad" 1
> "2" "CANADA" "mum dog" 1
> "2" "CANADA" "dog dad" 1
> "2" "CANADA" "dog mum" 1
> "3" "USA" "bird lion" 1
> "3" "USA" "bird car" 1
> "3" "USA" "lion bird" 1
> "3" "USA" "lion car" 1
> "3" "USA" "car bird" 1
> "3" "USA" "car lion" 1
第二步:点2克
> "1" "CANADA" "dad mum" 1
> "1" "CANADA" "child dad" 1
> "1" "CANADA" "dad mum" 1
> "1" "CANADA" "child mum" 1
> "1" "CANADA" "child dad" 1
> "1" "CANADA" "child mum" 1
> "2" "CANADA" "dad mum" 1
> "2" "CANADA" "dad dog" 1
> "2" "CANADA" "dad mum" 1
> "2" "CANADA" "dog mum" 1
> "2" "CANADA" "dad dog" 1
> "2" "CANADA" "dog mum" 1
> "3" "USA" "bird lion" 1
> "3" "USA" "bird car" 1
> "3" "USA" "bird lion" 1
> "3" "USA" "car lion" 1
> "3" "USA" "bird car" 1
> "3" "USA" "car lion" 1
步骤3:按id、国家、2-ngrams区分
> "1" "CANADA" "dad mum"
> "1" "CANADA" "child dad"
> "1" "CANADA" "child mum"
> "2" "CANADA" "dad mum"
> "2" "CANADA" "dad dog"
> "2" "CANADA" "dog mum"
> "3" "USA" "bird lion"
> "3" "USA" "bird car"
> "3" "USA" "car lion"
第四步:按国家统计,2-ngram
> "CANADA" "dad mum" 2
> "CANADA" "child dad" 1
> "CANADA" "child mum" 1
> "CANADA" "dad dog" 1
> "CANADA" "dog mum" 1
> "USA" "bird lion" 1
> "USA" "bird car" 1
> "USA" "car lion" 1
非常感谢你
1条答案
按热度按时间csbfibhn1#