pig脚本:count在null字段上返回0

w80xi6nr  于 2021-06-02  发布在  Hadoop
关注(0)|答案(1)|浏览(427)

我有一个pig脚本,它通过json的“company”部分加载一个文件。当我执行计数时,如果文件中缺少域(或空),则计数为0。我怎样才能把它分组为空字符串,并且仍然计算它呢?
文件示例:

{"company": {"domain": "test1.com", "name": "test1 company"}}
{"company": {"domain": "test1.com", "name": "test1 company"}}
{"company": {"domain": "test1.com", "name": "test2 company"}}
{"company": {"domain": "test2.com", "name": "test2 company"}}
{"company": {"domain": "test2.com", "name": "test3 company"}}
{"company": {"domain": "test3.com", "name": "test3 company"}}
{"company": {"domain": "test3.com", "name": "test3 company"}}
{"company": {"name": "test4 company"}}
{"company": {"name": "test4 company"}}

预期结果:

"test1.com", "test1 company", 2
"test1.com", "test2 company", 1
"test2.com", "test2 company", 1
"test2.com", "test3 company", 1
"test3.com", "test3 company", 2
"", "test4 company", 2

实际结果:

"test1.com", "test1 company", 2
"test1.com", "test2 company", 1
"test2.com", "test2 company", 1
"test2.com", "test3 company", 1
"test3.com", "test3 company", 2
, "test4 company", 0

当前Pig脚本:

data = LOAD'myfile' USINGorg.apache.pig.piggybank.storage.JsonLoader('company:   (domain:chararray, name:chararray)');
filtered = FILTER data BY (company is not null);
events = FOREACH filtered GENERATE FLATTEN(company) as (domain, name);
grouped = GROUP events BY (domain, name);
counts = FOREACH grouped GENERATE group as domain, COUNT(events) as count;
ordered = ORDER counts by count DESC;

谢谢你的帮助!

wnavrhmk

wnavrhmk1#

不是数数,而是数数星星,
counts=foreach grouped generate group as domain,count\u star(事件)as count;

相关问题