python自定义项

pdsfdshx 于 2021-06-21 发布在 Pig

关注(0)|答案(2)|浏览(433)

我在用Pig做作业。我已经计算了他需要的所有值，但是我需要以特定的格式输出它们，所以我用python编写了一个udf。它被传递了一袋元组 {(id: int,tfidf: double)} （pig的文档并没有具体说明python的用法，但是从示例中我猜它是一个元组的iterable）并且它返回一个 chararray . 实际代码为：

@outputSchema('doclist:chararray')
def format_list(docs):
  outs = []
  for docid, tfidf in docs:
    outs.append('{0}:{1}'.format(docid, tfidf))
  return '\t'.join(outs)

它是从

tfidf = FOREACH (GROUP tfsWithNDocs BY token) {
    idf = LOG((double)totaldocs.total / (double)ndocs);
    ranked = FOREACH tfsWithNDocs GENERATE id, tf * idf AS tfidf;
    ordered = ORDER ranked BY tfidf DESC;
    relevant = LIMIT ordered 20;
    GENERATE group AS token, funs.format_list(relevant) AS relevant;
};

当我运行脚本时，它失败了：

org.apache.pig.backend.executionengine.ExecException: ERROR 0: Error executing function
    at org.apache.pig.scripting.jython.JythonFunction.exec(JythonFunction.java:120)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:337)
    ... (several hadoop calls)

没有关于实际python异常的提示。
如果我不把数据传给我的自定义项并把它作为一个包存储，一切都会正常。
这个代码有什么问题？

hadoop python user-defined-functions apache-pig jython

来源：https://stackoverflow.com/questions/20545809/python-udf-for-apache-pig-failing