清管器csvexcelstorage双引号逗号

jm2pwxwz  于 2021-06-02  发布在  Hadoop
关注(0)|答案(1)|浏览(437)

我正在将csv格式的文件(字段是逗号分隔的和双引号)接收到hdfs中,并开发了一个pig脚本,在使用hql脚本将数据插入配置单元之前,该脚本将删除头行并去掉双引号。
这个过程一直运作良好;但是,今天我发现其中一个表存在数据问题。特别是这个表的文件有一个字符串字段,它可以在双引号中包含多个逗号。这会导致某些记录的数据被错误地加载到配置单元中的错误列中。
我无法更改源文件的格式。
目前我正在使用PiggybankCsvExcelStorage来处理csv格式,如下所示。是否可以对此进行修改以产生正确的结果?我还有别的选择吗?我注意到现在也有一个csvloader,但是还没有找到任何例子来说明如何使用/实现它。清管器装载机

USING org.apache.pig.piggybank.storage.CSVExcelStorage(',','NO_MULTILINE','NOCHANGE','SKIP_INPUT_HEADER')

编辑以添加其他样本数据和测试结果:
输入文件数据示例:

"P_NAME","P_ID","C_ID","C_NAME","C_TYPE","PROT","I_NAME","I_ID","A_NAME","A_IDS","C_NM","CO"    
"SAMPLEPNAME","123456","789123","SAMPLECNAME","Upload","SAMPLEINAME","This Sample Name of A, B, and C","3234","This Sample Name of A, B, and C","3234","c_name","R"
"SAMPLEPNAME2","123457","789124","SAMPLECNAME2","Download","SAMPLEINAME2","This Sample Name","3235","This Sample Name","3235","c_name2","Q"

使用上面提供格式的csvexcelloader:

SAMPLEPNAME,123456,789123,SAMPLECNAME,Upload,SAMPLEINAME,This Sample Name of A, B, and C,3234,This Sample Name of A, B, and C,3234,c_name,R
SAMPLEPNAME2,123457,789124,SAMPLECNAME2,Download,SAMPLEINAME2,This Sample Name,3235,This Sample Name,3235,c_name2,Q

使用csvloader作为csvloader():注意-没有看到要提供给构造函数的参数的任何选项

P_NAME,,,C_NAME,C_TYPE,PROT,I_NAME,,A_NAME,,C_NM,CO 
SAMPLEPNAME,123456,789123,SAMPLECNAME,Upload,SAMPLEINAME,This Sample Name of A, B, and C,3234,This Sample Name of A, B, and C,3234,c_name,R
SAMPLEPNAME2,123457,789124,SAMPLECNAME2,Download,SAMPLEINAME2,This Sample Name,3235,This Sample Name,3235,c_name2,Q

我看到的唯一真正的区别是csvloader没有删除头行,因为我看到没有选择这个选项,而是删除了一些头名称。
我做错什么了吗?如能找到有效的解决方案,我们将不胜感激。

brgchamk

brgchamk1#

为了解决字段中逗号的问题,您可以尝试这项工作。
将数据作为行加载。
将“,”视为分隔符,并将其替换为管道字符,即“|”。
用空字符串替换开头和结尾的引号。
将行加载到以“|”作为分隔符的配置单元中。

A = LOAD 'test1.csv' AS (lines:chararray);
ranked = rank A;
B = FILTER ranked BY (rank_A > 1);
C = FOREACH B GENERATE REPLACE($1,'","','|');
D = FOREACH C GENERATE REPLACE($0,'"','');
DUMP D;

a=加载“test1.csv”为(lines:chararray);

"P_NAME","P_ID","C_ID","C_NAME","C_TYPE","PROT","I_NAME","I_ID","A_NAME","A_IDS","C_NM","CO"
"SAMPLEPNAME","123456","789123","SAMPLECNAME","Upload","SAMPLEINAME","This Sample Name of A, B, and C","3234","This Sample Name of A, B, and C","3234","c_name","R"
"SAMPLEPNAME2","123457","789124","SAMPLECNAME2","Download","SAMPLEINAME2","This Sample Name","3235","This Sample Name","3235","c_name2","Q"

ranked=等级a;

(1,"P_NAME","P_ID","C_ID","C_NAME","C_TYPE","PROT","I_NAME","I_ID","A_NAME","A_IDS","C_NM","CO")
(2,"SAMPLEPNAME","123456","789123","SAMPLECNAME","Upload","SAMPLEINAME","This Sample Name of A, B, and C","3234","This S
ample Name of A, B, and C","3234","c_name","R")
(3,"SAMPLEPNAME2","123457","789124","SAMPLECNAME2","Download","SAMPLEINAME2","This Sample Name","3235","This Sample Name
","3235","c_name2","Q")

b=按(秩a>1)排序的过滤器;

(2,"SAMPLEPNAME","123456","789123","SAMPLECNAME","Upload","SAMPLEINAME","This Sample Name of A, B, and C","3234","This S
ample Name of A, B, and C","3234","c_name","R")
(3,"SAMPLEPNAME2","123457","789124","SAMPLECNAME2","Download","SAMPLEINAME2","This Sample Name","3235","This Sample Name
","3235","c_name2","Q")

c=foreach b generate replace($1,“,”,“,”);

("SAMPLEPNAME|123456|789123|SAMPLECNAME|Upload|SAMPLEINAME|This Sample Name of A, B, and C|3234|This S
ample Name of A, B, and C|3234|c_name|R")
("SAMPLEPNAME2|123457|789124|SAMPLECNAME2|Download|SAMPLEINAME2|This Sample Name|3235|This Sample Name
|3235|c_name2|Q")

d=foreach c generate replace($0,“,”,“”);

(SAMPLEPNAME|123456|789123|SAMPLECNAME|Upload|SAMPLEINAME|This Sample Name of A, B, and C|3234|This S
ample Name of A, B, and C|3234|c_name|R)
(SAMPLEPNAME2|123457|789124|SAMPLECNAME2|Download|SAMPLEINAME2|This Sample Name|3235|This Sample Name
|3235|c_name2|Q)

现在可以将此数据加载到以“|”作为分隔符的配置单元中。

相关问题