pig使用piggy bank jars处理不正确的数据

6yt4nkrj  于 2021-06-02  发布在  Hadoop
关注(0)|答案(2)|浏览(370)

我有一个文件,其结构如下所述:
身份证、姓名、地址

1,"Amrit,kumar",India   
2,"Vaibhav,arora",USA   
3,"Deepika,kumar",Germany

显然,如果我给pigstorage(','),这三个字段将被拆分为4,数据将溢出。选择:
我试过存钱罐,但问题仍然存在,数据仍然泄漏。请找到下面的脚本 A11 = LOAD 'File.csv.gz' USING org.apache.pig.piggybank.storage.CSVLoader() as (column:type) 我也尝试过替换函数,我有35k行,但没有对所有行进行更改。在这种情况下,数据仍然溢出。列值被转移到下一列。请在下面的引用链接中查找。
在pig中加载文件时如何忽略“(双引号)?
我也尝试了csvexcel存储和csv加载程序。
请建议我在这里能做些什么。我想把name值放在一列中。

pxyaymoc

pxyaymoc1#

使用您的数据测试了此脚本:

-- load as four fields
a = LOAD 'data.txt' using PigStorage(',');

-- removes single quotes from second and third fields
b = foreach a generate $0 as id, REPLACE($1, '"', '') as firstname, REPLACE($2, '"', '') as lastname, $0 as address;

-- combines second and third field with a ',' in between
c = foreach b generate id,  CONCAT(firstname, ',', lastname) as name, address;

现在,测试结果:

test = foreach c generate name;
dump test;
(Amrit,kumar)
(Vaibhav,arora)
(Deepika,kumar)
l0oc07j2

l0oc07j22#

将其加载到4个字段中,替换引号,在第2个字段后添加空格,最后在第2个和第3个字段处连接以在一个字段/列中获得全名。不需要外部jar。

A = LOAD 'File.csv.gz' USING PigStorage(',') AS (f1:int,f2:chararray,f3:chararray,f4:chararray);
B = FOREACH A GENERATE 
            f1,
            CONCAT(REPLACE(f2,'\\"',''),' ') as f2, -- replace beginning quote and add space at end
            REPLACE(f3,'\\"','') as f3,             -- replace ending quote
            f4;
C = FOREACH B GENERATE 
            f1 as id,
            CONCAT(f2,f3) as name,
            f4 as country;
DUMP C;

相关问题