xpath—在配置单元中解析xml数据时,标记中只有很少的元素,而另一个标记中没有

6vl6ewon  于 2021-06-24  发布在  Hive
关注(0)|答案(1)|浏览(539)

以下是xml片段:

<employment source="file">
<employer>
    <unparsed>EMPLOYER-2</unparsed>
</employer>
<fileDate>2020-07-21</fileDate>
<effDate>2020-07-21</effDate>
</employment>
<employment source="file">
<employer>
    <unparsed>EMPLOYER-1</unparsed>
</employer>
<occupation>NURSE</occupation>
<hiredDate>2006-09-01</hiredDate>
<fileDate>2015-08-07</fileDate>
<effDate>2015-08-07</effDate>
</employment>

在上面的代码中,xml包含就业信息。上面的信息是2个雇主,但标签下就业是不一样的。例如:employe-2没有“雇佣日期和职业”信息,但employer-1有。需要按以下格式从xml文件中提取数据的逻辑:

我试过使用explode和posexplode,但没有成功。

mlmc2os5

mlmc2os51#

以下步骤可能会有所帮助,
使用maven的hivexmlserdehttps://mvnrepository.com/artifact/com.ibm.spss.hive.serde2.xml/hivexmlserde
将输入xml文件复制到指向hdfs的配置单元外部表 hadoop fs -copyFromLocal emp.xml /stackoverflow/data/hive/dwh/employee 将XMLSerdeJAR添加到HiveTerminal中,并创建ddl,如下所示。

add jars file:///home/sathya/Downloads/hivexmlserde-1.0.5.3.jar;
Added [file:///home/sathya/Downloads/hivexmlserde-1.0.5.3.jar] to class path
Added resources: [file:///home/sathya/Downloads/hivexmlserde-1.0.5.3.jar]

CREATE EXTERNAL TABLE employee (
`employer_name` string,
`occupation` string,
`hiredDate` string,
`fileDate` string,
`effDate` string
)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.employer_name"="/employment/employer/unparsed/text()",
"column.xpath.occupation"="/employment/occupation/text()",
"column.xpath.hiredDate"="/employment/hiredDate/TillNo/text()",
"column.xpath.fileDate"="/employment/fileDate/text()",
"column.xpath.effDate"="/employment/effDate/text()"
)
STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
    OUTPUTFORMAT 
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' 
    LOCATION 'hdfs://localhost:9000/stackoverflow/data/hive/dwh/employee'
    TBLPROPERTIES (
    "xmlinput.start"="<employment","xmlinput.end"="</employment>"
);

select * from employee;

EMPLOYER-2  NULL    NULL    2020-07-21  2020-07-21
EMPLOYER-1  NURSE   NULL    2015-08-07  2015-08-07

相关问题