清管器csvexcelstorage双引号逗号

jm2pwxwz 于 2021-06-02 发布在 Hadoop

关注(0)|答案(1)|浏览(436)

我正在将csv格式的文件（字段是逗号分隔的和双引号）接收到hdfs中，并开发了一个pig脚本，在使用hql脚本将数据插入配置单元之前，该脚本将删除头行并去掉双引号。
这个过程一直运作良好；但是，今天我发现其中一个表存在数据问题。特别是这个表的文件有一个字符串字段，它可以在双引号中包含多个逗号。这会导致某些记录的数据被错误地加载到配置单元中的错误列中。
我无法更改源文件的格式。
目前我正在使用PiggybankCsvExcelStorage来处理csv格式，如下所示。是否可以对此进行修改以产生正确的结果？我还有别的选择吗？我注意到现在也有一个csvloader，但是还没有找到任何例子来说明如何使用/实现它。清管器装载机

USING org.apache.pig.piggybank.storage.CSVExcelStorage(',','NO_MULTILINE','NOCHANGE','SKIP_INPUT_HEADER')

编辑以添加其他样本数据和测试结果：
输入文件数据示例：

"P_NAME","P_ID","C_ID","C_NAME","C_TYPE","PROT","I_NAME","I_ID","A_NAME","A_IDS","C_NM","CO"    
"SAMPLEPNAME","123456","789123","SAMPLECNAME","Upload","SAMPLEINAME","This Sample Name of A, B, and C","3234","This Sample Name of A, B, and C","3234","c_name","R"
"SAMPLEPNAME2","123457","789124","SAMPLECNAME2","Download","SAMPLEINAME2","This Sample Name","3235","This Sample Name","3235","c_name2","Q"

使用上面提供格式的csvexcelloader：

SAMPLEPNAME,123456,789123,SAMPLECNAME,Upload,SAMPLEINAME,This Sample Name of A, B, and C,3234,This Sample Name of A, B, and C,3234,c_name,R
SAMPLEPNAME2,123457,789124,SAMPLECNAME2,Download,SAMPLEINAME2,This Sample Name,3235,This Sample Name,3235,c_name2,Q

使用csvloader作为csvloader（）：注意-没有看到要提供给构造函数的参数的任何选项

P_NAME,,,C_NAME,C_TYPE,PROT,I_NAME,,A_NAME,,C_NM,CO 
SAMPLEPNAME,123456,789123,SAMPLECNAME,Upload,SAMPLEINAME,This Sample Name of A, B, and C,3234,This Sample Name of A, B, and C,3234,c_name,R
SAMPLEPNAME2,123457,789124,SAMPLECNAME2,Download,SAMPLEINAME2,This Sample Name,3235,This Sample Name,3235,c_name2,Q

我看到的唯一真正的区别是csvloader没有删除头行，因为我看到没有选择这个选项，而是删除了一些头名称。
我做错什么了吗？如能找到有效的解决方案，我们将不胜感激。

hadoop csv apache-pig delimiter

来源：https://stackoverflow.com/questions/38315557/pig-csvexcelstorage-doublequoted-commas

1条答案

按热度按时间

brgchamk1#

为了解决字段中逗号的问题，您可以尝试这项工作。
将数据作为行加载。
将“，”视为分隔符，并将其替换为管道字符，即“|”。
用空字符串替换开头和结尾的引号。
将行加载到以“|”作为分隔符的配置单元中。

A = LOAD 'test1.csv' AS (lines:chararray);
ranked = rank A;
B = FILTER ranked BY (rank_A > 1);
C = FOREACH B GENERATE REPLACE($1,'","','|');
D = FOREACH C GENERATE REPLACE($0,'"','');
DUMP D;

a=加载“test1.csv”为(lines:chararray);

"P_NAME","P_ID","C_ID","C_NAME","C_TYPE","PROT","I_NAME","I_ID","A_NAME","A_IDS","C_NM","CO"
"SAMPLEPNAME","123456","789123","SAMPLECNAME","Upload","SAMPLEINAME","This Sample Name of A, B, and C","3234","This Sample Name of A, B, and C","3234","c_name","R"
"SAMPLEPNAME2","123457","789124","SAMPLECNAME2","Download","SAMPLEINAME2","This Sample Name","3235","This Sample Name","3235","c_name2","Q"

ranked=等级a；

(1,"P_NAME","P_ID","C_ID","C_NAME","C_TYPE","PROT","I_NAME","I_ID","A_NAME","A_IDS","C_NM","CO")
(2,"SAMPLEPNAME","123456","789123","SAMPLECNAME","Upload","SAMPLEINAME","This Sample Name of A, B, and C","3234","This S
ample Name of A, B, and C","3234","c_name","R")
(3,"SAMPLEPNAME2","123457","789124","SAMPLECNAME2","Download","SAMPLEINAME2","This Sample Name","3235","This Sample Name
","3235","c_name2","Q")

b=按（秩a>1）排序的过滤器；

(2,"SAMPLEPNAME","123456","789123","SAMPLECNAME","Upload","SAMPLEINAME","This Sample Name of A, B, and C","3234","This S
ample Name of A, B, and C","3234","c_name","R")
(3,"SAMPLEPNAME2","123457","789124","SAMPLECNAME2","Download","SAMPLEINAME2","This Sample Name","3235","This Sample Name
","3235","c_name2","Q")

c=foreach b generate replace（$1，“，”，“，”）；

("SAMPLEPNAME|123456|789123|SAMPLECNAME|Upload|SAMPLEINAME|This Sample Name of A, B, and C|3234|This S
ample Name of A, B, and C|3234|c_name|R")
("SAMPLEPNAME2|123457|789124|SAMPLECNAME2|Download|SAMPLEINAME2|This Sample Name|3235|This Sample Name
|3235|c_name2|Q")

d=foreach c generate replace（$0，“，”，“”）；

(SAMPLEPNAME|123456|789123|SAMPLECNAME|Upload|SAMPLEINAME|This Sample Name of A, B, and C|3234|This S
ample Name of A, B, and C|3234|c_name|R)
(SAMPLEPNAME2|123457|789124|SAMPLECNAME2|Download|SAMPLEINAME2|This Sample Name|3235|This Sample Name
|3235|c_name2|Q)

现在可以将此数据加载到以“|”作为分隔符的配置单元中。

赞(0）回复(0）举报 2021-06-02

我来回答

清管器csvexcelstorage双引号逗号

1条答案

相关问题

热门标签

最新问答