pig-将数据包传递给udf构造函数

kkbh8khc  于 2021-06-21  发布在  Pig
关注(0)|答案(2)|浏览(344)

我有一个脚本正在加载一些关于场馆的数据:

venues = LOAD 'venues_extended_2.csv' USING org.apache.pig.piggybank.storage.CSVLoader() AS (Name:chararray, Type:chararray, Latitude:double, Longitude:double, City:chararray, Country:chararray);

然后我想创建一个udf,它有一个接受类型的构造函数。
所以我试着这样定义这个自定义项:

DEFINE GenerateVenues org.gla.anton.udf.main.GenerateVenues(venues);

下面是实际的自定义项:

public class GenerateVenues extends EvalFunc<Tuple> {

    TupleFactory mTupleFactory = TupleFactory.getInstance();
    BagFactory mBagFactory = BagFactory.getInstance();

    private static final String ALLCHARS = "(.*)";
    private ArrayList<String> venues;

    private String regex;

    public GenerateVenues(DataBag venuesBag) {
        Iterator<Tuple> it = venuesBag.iterator();
        venues = new ArrayList<String>((int) (venuesBag.size() + 1)); // possible fails!!!
        String current = "";
        regex = "";
        while (it.hasNext()){
            Tuple t = it.next();
            try {
                current = "(" + ALLCHARS + t.get(0) + ALLCHARS + ")";
                venues.add((String) t.get(0));
            } catch (ExecException e) {
                throw new IllegalArgumentException("VenuesRegex: requires tuple with at least one value");
            }
            regex += current + (it.hasNext() ? "|" : "");
        }
    }

    @Override
    public Tuple exec(Tuple tuple) throws IOException {
        // expect one string
        if (tuple == null || tuple.size() != 2) {
            throw new IllegalArgumentException(
                    "BagTupleExampleUDF: requires two input parameters.");
        }
        try {
            String tweet = (String) tuple.get(0);
            for (String venue: venues)
            {
                if (tweet.matches(ALLCHARS + venue + ALLCHARS))
                {
                    Tuple output = mTupleFactory.newTuple(Collections.singletonList(venue));
                    return output;
                }
            }
            return null;
        } catch (Exception e) {
            throw new IOException(
                    "BagTupleExampleUDF: caught exception processing input.", e);
        }
    }
}

执行时,脚本在 DEFINE 就在那之前 (venues); :

2013-12-19 04:28:06,072 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <file script.pig, line 6, column 60>  mismatched input 'venues' expecting RIGHT_PAREN

很明显我做错了什么事,你能帮我找出问题所在吗。是自定义项不能接受场馆关系作为参数。或者关系不是由 DataBag 这样地 public GenerateVenues(DataBag venuesBag) ? 谢谢!
ps我正在使用pig版本0.11.1.1.3.0.0-107。

k97glaaz

k97glaaz1#

不能将关系用作自定义项构造函数中的参数。只有字符串可以作为参数传递,如果它们确实是另一种类型,则必须在构造函数中解析它们。

jaql4c8m

jaql4c8m2#

正如@winnienicklaus已经说过的,您只能将字符串传递给udf构造函数。
话虽如此,您的问题的解决方案是使用分布式缓存,您需要重写 public List<String> getCacheFiles() 返回通过分布式缓存可用的文件名列表。这样,就可以将文件作为本地文件读取并构建表。
缺点是pig没有初始化功能,所以必须实现类似

private void init() {
    if (!this.initialized) {
        // read table
    }
}

然后把它称为 exec .

相关问题