在pig中，如何计算包含特定字符串的行数？

roejwanj 于 2021-06-02 发布在 Hadoop

关注(0)|答案(1)|浏览(382)

假设我有一组目标词：

a b c d

和输入文件：

a d f s g e
12399
c a d i f
a 2

那么我应该回来：

a 3
b 0
c 1
d 2

我怎样才能在Pig身上做到？谢谢您！

hadoop apache-pig

来源：https://stackoverflow.com/questions/40006751/in-pig-how-do-i-count-the-number-of-lines-that-contained-a-specific-string

1条答案

按热度按时间

bvuwiixz1#

首先从每行中删除重复的单词，然后运行单词计数。
清管器步骤：

REGISTER 'udf-1.0-SNAPSHOT.jar'
define tuple_set com.ts.pig.UniqueRecords();
data = load '<file>' using PigStorage();

删除每行中的重复单词

unique= foreach data generate tuple_set($0) as line;
words= foreach unique generate flatten(TOKENIZE(line,' ')) as word;
grouped = group words BY word;
count= foreach grouped GENERATE group, COUNT(words);
dump count;

pig自定义项示例代码：

/**
 * This udf removes duplicate words from line
 */
public class UniqueRecords extends EvalFunc<String> {
    @Override
    public String exec(Tuple tuple) throws IOException {
        if (tuple == null || tuple.size() == 0)
            return null;
        String[] splits=tuple.get(0).toString().split(" ");
        Set<String> elements = new HashSet<String>(Arrays.asList(splits));
        StringBuilder sb = new StringBuilder();
        for(String element:elements ){
            sb.append(element+" ");
        }
        return sb.toString();
    }
}

赞(0）回复(0）举报 2021-06-03

我来回答

在pig中，如何计算包含特定字符串的行数？

1条答案

相关问题

热门标签

最新问答