如何为pig中的无分隔符文件创建模式?

xe55xuns  于 2021-05-30  发布在  Hadoop
关注(0)|答案(3)|浏览(329)

我有以下类型的cdr:

068373748102208100167682477351905149071PLAN1MOCCUST10612287077212:07:1201/01/2012
068373748102208100167682477351905149071PLAN1MTCCUST20600000001312:15:0901/01/2012
068373748102208100167682477351905149071PLAN1SMSCUST10613637193012:18:1801/01/2012
068373748102208100167682477351905149071PLAN1SMSCUST10612899062012:21:0701/01/2012

我必须使用pig加载此文件,其模式如下:

MSIDN:IMSI:IMEI:PLAN:CALL_TYPE:CORRESP_TYPE:CORRESP_ISDN:DURATION:TIME:DATE

我知道每个模式的长度,但我无法找到如何以正确的格式加载数据。以下是所需的长度格式,从第一列开始:

13
15
12
5
3
5
11
1
hh:mm:ss
dd/mm/yyyy
mm9b1k5b

mm9b1k5b1#

您可以查看piggybank中的fixedwidthloader来加载位置分隔的文件(我曾用它来加载类似于您在这里提到的文件)。
例如,我们可以指定列位置和列Map,如下所示
a=使用org.apache.pig.piggybank.storage.fixedwidthloader('1-6,7-5','write_header','col1:chararray,col2:chararray')加载'inputfile.txt';
http://pig.apache.org/docs/r0.12.0/api/org/apache/pig/piggybank/storage/fixedwidthloader.html

fjnneemd

fjnneemd2#

一种可能的解决方案是使用普通的清管器装入器装入,然后通过一个udf来获取列。我会试着带上密码,今晚就寄出去。如承诺:

]$ more cdr.txt
    068373748102208100167682477351905149071PLAN1MOCCUST10612287077212:07:1201/01/2012
    068373748102208100167682477351905149071PLAN1MTCCUST20600000001312:15:0901/01/2012
    068373748102208100167682477351905149071PLAN1SMSCUST10613637193012:18:1801/01/2012
    068373748102208100167682477351905149071PLAN1SMSCUST10612899062012:21:0701/01/2012

    ]$ more cdr.py
    import sys

    def mysubstr(input,start,nc):

            return input[start:nc]

    ]$ more cdr.pig
    REGISTER 'cdr.py' using jython as mysubstr;
    A = LOAD 'cdr.txt' AS (inp:chararray);
    B = FOREACH A GENERATE                
    inp, mysubstr.mysubstr(inp,0,13), 
    mysubstr.mysubstr(inp,14,29), 
    mysubstr.mysubstr(inp,30,42);
    DUMP B;

输出:(0683737481022081001676824773519051490711计划发生10612287077212:07:1201/01/201683737481022810016768247735905149071pla)(068373748102208100167682477351905149071plantccustc2060000001312:15:0901/01/201206837481021001676824775905149071pla)(068373748102208100167682477351905149071plan1msccust10613637193012:18:1801/01/2012,0683737481022810016768247735905149071PLA)(068373748102208100167682477351905149071MSCSCUT10612899062012:21:0701/01/2016837481022810016768247735905149071PLA)

aelbi1ox

aelbi1ox3#

pigstorage(默认的加载和存储功能)不处理这种情况。您需要编写自己的加载函数。使用pigstorage作为一个模型应该不会太困难。您不需要寻找字段分隔符,只需根据长度解析字段,然后使用标准字符串函数来修剪空白。
读这个http://pig.apache.org/docs/r0.7.0/udf.html#store+功能

相关问题