我的wordcount程序到底做错了什么(清管器)

ecfsfe2w 于 2021-06-24 发布在 Pig

关注(0)|答案(1)|浏览(345)

我对pig很不熟悉，我想尝试对不带标点符号的单词进行排序。我可以很好地转储d，当我尝试转储e并得到这个错误时，问题就来了。

[main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias E

A = load './SherlockHolmes.txt' using PigStorage(' ');
B = foreach A generate FLATTEN(REGEX_EXTRACT_ALL(LOWER((chararray)$0),'([A-Za-z]+)')) as word;
C = group B by word;
D = foreach C generate COUNT(B) AS counts, group AS word;
E = ORDER D BY counts DESC;
DUMP E;

我做错什么了？

apache-pig

来源：https://stackoverflow.com/questions/17951375/what-exactly-am-i-doing-wrong-with-my-wordcount-program-pig

1条答案

按热度按时间

eyh26e7m1#

对于这个答案，我将使用这个作为示例输入：

Hello, my ;name is Holmes.                                                       
This is a test, of a question on SO.                                            
Holmes, again.

当我第一次写剧本的时候，我发现 DESCRIBE 以及 DUMP 每一步都有一些样本数据，这样我就能确切地知道发生了什么。用你的剧本来表现：

A = load './SherlockHolmes.txt' using PigStorage(' ');
-- Schema for A unknown.
-- (Hello,,my,name,is,Holmes.)
-- (This,is,a,test,,of,a,question,on,SO.)
-- (Holmes,,again.)

所以 A 是一个值数目未知的“元组”（实际上是一个模式）。一般来说，如果您不知道may值在元组中的形式，那么应该改用bag。

B = foreach A generate FLATTEN(REGEX_EXTRACT_ALL(LOWER((chararray)$0),'([A-Za-z]+)')) as word;
-- B: {word: bytearray}
-- ()
-- (this)
-- ()

当你使用 $0 您指的不是模式中的所有单词，而是第一个单词。所以你只是在申请 LOWER 以及 REGEX_EXTRACT_ALL 第一个字。另外，请注意 FLATTEN 运算符正在元组上执行，而不会生成所需的输出。你想吗 FLATTEN 一个袋子。 C , D ，和 E 所有这些都应该如您所期望的那样工作，所以这一切都是关于对数据进行处理，使其成为他们可以使用的格式。
知道了这一点，你可以这样做：

-- Load in the line as a chararray so that TOKENIZE can convert it into a bag
A = load './tests/sh.txt' AS (foo:chararray);

B1 = FOREACH A GENERATE TOKENIZE(foo, ' ') AS tokens: {T:(word: chararray)} ;
-- Output from B1:
-- B1: {tokens: {T: (word: chararray)}}
-- ({(Hello,),(my),(;name),(is),(Holmes.)})
-- ({(This),(is),(a),(test,),(of),(a),(question),(on),(SO.)})
-- ({(Holmes,),(again.)})

-- Now inside a nested FOREACH we apply the appropriate transformations.
B2 = FOREACH B1 {

    -- Inside a nested FOREACH you can go over the contents of a bag
    cleaned = FOREACH tokens GENERATE 
              -- The .*? are needed to capture the leading and trailing punc.
              FLATTEN(REGEX_EXTRACT_ALL(LOWER(word),'.*?([a-z]+).*?')) as word ;

    -- Cleaned is a bag, so when we FLATTEN it we get one word per line
    GENERATE FLATTEN(cleaned) ;
}

所以现在 B2 是：

B2: {cleaned::word: bytearray}
(hello)
(my)
(name)
(is)
(holmes)
(this)
(is)
(a)
(test)
(of)
(a)
(question)
(on)
(so)
(holmes)
(again)

当它进入 C , D ，和 E ，将提供所需的输出。
如果你需要我澄清什么，请告诉我。

赞(0）回复(0）举报 2021-06-24

我来回答

我的wordcount程序到底做错了什么(清管器)

1条答案

相关问题

热门标签

最新问答