我有一个脚本正在加载一些关于场馆的数据:
venues = LOAD 'venues_extended_2.csv' USING org.apache.pig.piggybank.storage.CSVLoader() AS (Name:chararray, Type:chararray, Latitude:double, Longitude:double, City:chararray, Country:chararray);
然后我想创建一个udf,它有一个接受类型的构造函数。
所以我试着这样定义这个自定义项:
DEFINE GenerateVenues org.gla.anton.udf.main.GenerateVenues(venues);
下面是实际的自定义项:
public class GenerateVenues extends EvalFunc<Tuple> {
TupleFactory mTupleFactory = TupleFactory.getInstance();
BagFactory mBagFactory = BagFactory.getInstance();
private static final String ALLCHARS = "(.*)";
private ArrayList<String> venues;
private String regex;
public GenerateVenues(DataBag venuesBag) {
Iterator<Tuple> it = venuesBag.iterator();
venues = new ArrayList<String>((int) (venuesBag.size() + 1)); // possible fails!!!
String current = "";
regex = "";
while (it.hasNext()){
Tuple t = it.next();
try {
current = "(" + ALLCHARS + t.get(0) + ALLCHARS + ")";
venues.add((String) t.get(0));
} catch (ExecException e) {
throw new IllegalArgumentException("VenuesRegex: requires tuple with at least one value");
}
regex += current + (it.hasNext() ? "|" : "");
}
}
@Override
public Tuple exec(Tuple tuple) throws IOException {
// expect one string
if (tuple == null || tuple.size() != 2) {
throw new IllegalArgumentException(
"BagTupleExampleUDF: requires two input parameters.");
}
try {
String tweet = (String) tuple.get(0);
for (String venue: venues)
{
if (tweet.matches(ALLCHARS + venue + ALLCHARS))
{
Tuple output = mTupleFactory.newTuple(Collections.singletonList(venue));
return output;
}
}
return null;
} catch (Exception e) {
throw new IOException(
"BagTupleExampleUDF: caught exception processing input.", e);
}
}
}
执行时,脚本在 DEFINE
就在那之前 (venues);
:
2013-12-19 04:28:06,072 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <file script.pig, line 6, column 60> mismatched input 'venues' expecting RIGHT_PAREN
很明显我做错了什么事,你能帮我找出问题所在吗。是自定义项不能接受场馆关系作为参数。或者关系不是由 DataBag
这样地 public GenerateVenues(DataBag venuesBag)
? 谢谢!
ps我正在使用pig版本0.11.1.1.3.0.0-107。
2条答案
按热度按时间k97glaaz1#
不能将关系用作自定义项构造函数中的参数。只有字符串可以作为参数传递,如果它们确实是另一种类型,则必须在构造函数中解析它们。
jaql4c8m2#
正如@winnienicklaus已经说过的,您只能将字符串传递给udf构造函数。
话虽如此,您的问题的解决方案是使用分布式缓存,您需要重写
public List<String> getCacheFiles()
返回通过分布式缓存可用的文件名列表。这样,就可以将文件作为本地文件读取并构建表。缺点是pig没有初始化功能,所以必须实现类似
然后把它称为
exec
.