pig udf中的表模式

r7knjye2 于 2021-06-02 发布在 Hadoop

关注(0)|答案(1)|浏览(483)

在将数据加载到配置单元表之前，我必须格式化平面文件中的数据。

CF32|4711|00010101Z| +34.883|  98562AS1D |N8594ãä| 00   | 2

文件是管道分隔的，我需要在平面文件的不同列上应用不同的清理和格式化功能。我有多个函数来清除文本，格式化日期，格式化时间戳，格式化整数等等。
我的想法是将模式作为构造函数传递给我的udf，并在pig中调用平面文件上的不同函数。

A = LOAD 'call_detail_records'  USING org.apache.hcatalog.pig.HCatLoader();
DESCRIBE A;

REGISTER ZPigUdfs.jar;
DEFINE DFormat com.zna.pig.udf.DataColumnFormatter(A);

B = FOREACH A GENERATE DFormat($0);
DUMP B;

但是我怎样才能传递模式呢？dump a实际上转储整个表，但我只需要元数据。我当前的udf伪代码看起来像
公共类datacolumnformatter扩展了evalfunc{

private Tuple schema;

public DataColumnFormatter(Tuple schema) {
    this.schema = schema;
}

@Override
public String exec(Tuple inputTuple) throws IOException {

    if (inputTuple != null && inputTuple.size() > 0) {
        String inpString = inputTuple.get(0).toString();
        System.out.println(inpString);
        System.out.println(schema);

        /**
         * Logic for splitting the string as pipe and apply functions based
         * on positions of schema if(schema[1] -> date ){
         * 
         * formatDate(input) }else if(schema[1] -> INT ){
         * 
         * formatInt(input); }
         * 
         */

    }

    return null;
}

}
如何在pig udf中获得模式，或者是否有其他方法来实现这一点。
提前谢谢。

hadoop Hive udf apache-pig hcatalog

来源：https://stackoverflow.com/questions/30493143/table-schema-inside-pig-udf

1条答案

按热度按时间

rvpgvaaj1#

在你的evalfunc里你可以打电话 this.getInputSchema() （至少从清管器v0.12开始，可能更早）。您不需要做任何特殊的事情来传递模式，因为您是从hcatalog加载的， A 已经装饰好了。
或者，您可以考虑为每种数据类型划分单独的udf函数。像这样的 B = FOREACH A GENERATE dateFormat($0), cleanText($1), dateFormat($2);

赞(0）回复(0）举报 2021-06-02

我来回答

pig udf中的表模式

1条答案

相关问题

热门标签

最新问答