我正在尝试使用配置单元xml serde解析嵌套的xml文件,但是在解析连续的嵌套结构时遇到了问题。我的示例xml文件如下所示:
"<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE Task SYSTEM 'createTask.dtd'>
<TaskInitiateDs>
<RowId>1689</RowId>
<SourceSystem>EWO</SourceSystem>
<TaskIdentifier>NR5Z001</TaskIdentifier>
<TaskTypeDs>
<TaskCategory>N</TaskCategory>
<TaskTypeId>FND-BCD</TaskTypeId>
<SkillDurationDs>
<SkillCode>ACFC</SkillCode>
<Duration>00020</Duration>
<NumberOfPeople>1</NumberOfPeople>
</SkillDurationDs>
<SkillDurationDs>
<SkillCode>TET</SkillCode>
<Duration>00005</Duration>
<NumberOfPeople>7</NumberOfPeople>
</SkillDurationDs>
<SkillDurationDs>
<SkillCode>TEL</SkillCode>
<NumberOfPeople>8</NumberOfPeople>
<NumberOfPeople>10</NumberOfPeople>
<NumberOfPeople>30</NumberOfPeople>
</SkillDurationDs>
</TaskTypeDs>
<TaskTypeDs>
<TaskCategory>Y</TaskCategory>
<TaskTypeId>FND-BCO</TaskTypeId>
<SkillDurationDs>
<SkillCode>KDN</SkillCode>
<Duration>00050</Duration>
<NumberOfPeople>3</NumberOfPeople>
</SkillDurationDs>
<SkillDurationDs>
<Duration>00052</Duration>
<Duration>00072</Duration>
<NumberOfPeople>16</NumberOfPeople>
</SkillDurationDs>
</TaskTypeDs>
<BusinessAssetIdCode>CLOV</BusinessAssetIdCode>
<PostCode>jasif</PostCode>
</TaskInitiateDs>"
我希望输出如下与数组分解。
Row_ID TaskCategory TaskType Skillcode Duration Numberofpeople
1689 N FND-BCD ACFC 20 1
1689 N FND-BCD TET 5 7
1689 N FND-BCD TEL NULL 8
1689 N FND-BCD TEL NULL 10
1689 N FND-BCD TEL NULL 30
1689 Y FND-BCO KDN 50 3
1689 Y FND-BCO NULL 52 16
1689 Y FND-BCO NULL 72 16
我尝试了下面的代码,但没有得到最后3列所需的输出。
Create external table TRIAL_TABLE (
ROW_ID string,
AnalogueDataLineTestSystemDs array<struct<TaskTypeDs:struct<TaskCategory:string,TaskTypeId:string,SkillDurationDs:struct<SkillCode:array<string>,Duration:array<string>,NumberOfPeople:array<string>>>>>
)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.ROW_ID"="/TaskInitiateDs/RowId/text()",
"column.xpath.AnalogueDataLineTestSystemDs"="/TaskInitiateDs/TaskTypeDs"
)
STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION '/user/Hadoopdir/import_target_dir/test3'
TBLPROPERTIES ("xmlinput.start"="<TaskInitiateDs","xmlinput.end"= "</TaskInitiateDs>");
select t4.row_id,t4.taskcategory,t4.tasktypeid,L4.*,L5.*,L6.* from
(select t3.row_id,t3.taskcategory,t3.tasktypeid,L3.skillcode,L3.duration,L3.numberofpeople from
(select t2.row_id,L2.taskcategory,L2.tasktypeid,L2.skilldurationds from
(select row_id,tasktypeds_explo_elem from
(select row_ID,tasktypeds_explo from TRIAL_TABLE lateral view explode(AnalogueDataLineTestSystemDs) tasktypeds_explo as tasktypeds_explo) t1
lateral view inline(array(t1.tasktypeds_explo)) tasktypeds_explo_elem as tasktypeds_explo_elem) t2
lateral view inline(array(t2.tasktypeds_explo_elem)) L2) t3
lateral view inline(array(t3.skilldurationds)) L3) t4
lateral view explode(t4.skillcode) L4
lateral view explode(t4.duration) L5
lateral view explode(t4.numberofpeople) L6;
我找不到任何以前的相关问题来解析密钥数不断变化的嵌套xml。
另外,我假设如果有多个横向视图,那么在针对实际大数据使用时会对性能产生很大影响。
暂无答案!
目前还没有任何答案,快来回答吧!