如何使用配置单元xml serde解析嵌套的xml数据,并更改键的数量

qxsslcnc  于 2021-05-31  发布在  Hadoop
关注(0)|答案(0)|浏览(233)

我正在尝试使用配置单元xml serde解析嵌套的xml文件,但是在解析连续的嵌套结构时遇到了问题。我的示例xml文件如下所示:

"<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE Task SYSTEM 'createTask.dtd'>
<TaskInitiateDs>
<RowId>1689</RowId>
<SourceSystem>EWO</SourceSystem>
<TaskIdentifier>NR5Z001</TaskIdentifier>
<TaskTypeDs>
<TaskCategory>N</TaskCategory>
<TaskTypeId>FND-BCD</TaskTypeId>
<SkillDurationDs>
<SkillCode>ACFC</SkillCode>
<Duration>00020</Duration>
<NumberOfPeople>1</NumberOfPeople>
</SkillDurationDs>
<SkillDurationDs>
<SkillCode>TET</SkillCode>
<Duration>00005</Duration>
<NumberOfPeople>7</NumberOfPeople>
</SkillDurationDs>
<SkillDurationDs>
<SkillCode>TEL</SkillCode>
<NumberOfPeople>8</NumberOfPeople>
<NumberOfPeople>10</NumberOfPeople>
<NumberOfPeople>30</NumberOfPeople>
</SkillDurationDs>
</TaskTypeDs>
<TaskTypeDs>
<TaskCategory>Y</TaskCategory>
<TaskTypeId>FND-BCO</TaskTypeId>
<SkillDurationDs>
<SkillCode>KDN</SkillCode>
<Duration>00050</Duration>
<NumberOfPeople>3</NumberOfPeople>
</SkillDurationDs>
<SkillDurationDs>
<Duration>00052</Duration>
<Duration>00072</Duration>
<NumberOfPeople>16</NumberOfPeople>
</SkillDurationDs>
</TaskTypeDs>
<BusinessAssetIdCode>CLOV</BusinessAssetIdCode>
<PostCode>jasif</PostCode>
</TaskInitiateDs>"

我希望输出如下与数组分解。

Row_ID  TaskCategory    TaskType    Skillcode   Duration    Numberofpeople
1689    N               FND-BCD     ACFC        20          1
1689    N               FND-BCD     TET         5           7
1689    N               FND-BCD     TEL         NULL        8
1689    N               FND-BCD     TEL         NULL        10
1689    N               FND-BCD     TEL         NULL        30
1689    Y               FND-BCO     KDN         50          3
1689    Y               FND-BCO     NULL        52          16
1689    Y               FND-BCO     NULL        72          16

我尝试了下面的代码,但没有得到最后3列所需的输出。

Create external table TRIAL_TABLE (
ROW_ID string,
AnalogueDataLineTestSystemDs array<struct<TaskTypeDs:struct<TaskCategory:string,TaskTypeId:string,SkillDurationDs:struct<SkillCode:array<string>,Duration:array<string>,NumberOfPeople:array<string>>>>>
)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.ROW_ID"="/TaskInitiateDs/RowId/text()",
"column.xpath.AnalogueDataLineTestSystemDs"="/TaskInitiateDs/TaskTypeDs"
)
STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION '/user/Hadoopdir/import_target_dir/test3'
TBLPROPERTIES ("xmlinput.start"="<TaskInitiateDs","xmlinput.end"= "</TaskInitiateDs>");

select t4.row_id,t4.taskcategory,t4.tasktypeid,L4.*,L5.*,L6.* from 
(select t3.row_id,t3.taskcategory,t3.tasktypeid,L3.skillcode,L3.duration,L3.numberofpeople from 
(select t2.row_id,L2.taskcategory,L2.tasktypeid,L2.skilldurationds from 
(select row_id,tasktypeds_explo_elem from 
(select row_ID,tasktypeds_explo from TRIAL_TABLE lateral view explode(AnalogueDataLineTestSystemDs) tasktypeds_explo as tasktypeds_explo) t1 
lateral view inline(array(t1.tasktypeds_explo)) tasktypeds_explo_elem as tasktypeds_explo_elem) t2 
lateral view inline(array(t2.tasktypeds_explo_elem)) L2) t3 
lateral view inline(array(t3.skilldurationds)) L3) t4 
lateral view explode(t4.skillcode) L4 
lateral view explode(t4.duration) L5 
lateral view explode(t4.numberofpeople) L6;

我找不到任何以前的相关问题来解析密钥数不断变化的嵌套xml。
另外,我假设如果有多个横向视图,那么在针对实际大数据使用时会对性能产生很大影响。

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题