配置单元:将json数据嵌套到配置单元表中的挑战

fzsnzjdm  于 2021-06-28  发布在  Hive
关注(0)|答案(0)|浏览(295)

我正在尝试将深度嵌套的json数据加载到配置单元表中。让我告诉你们我到目前为止都做了些什么。
1-我有json文件,它们嵌套得很深,就像结构数组一样,又有结构字段。
2-我成功地将这个json数据加载到spark数据框中,并且能够看到模式。此外,我使用sparkshell的以下命令成功地将这个Dataframe存储为一个配置单元表。

new org.apache.spark.sql.hive.HiveContext(sc).read.json(
"/user/alpha/test.json").saveAsTable("mywarehouse.patent_data_2001");

但当我想问你 select * from patent_data_2001 limit 1 它给了我以下的错误

FAILED: IllegalArgumentException Error: type expected at the position 4339 of 'array<struct<id:string>>:struct<claim:array<struc
.
.
.
.
.
,wi:string>,nb_file:string>>:struct<id:string,sequence_list:struct<carriers:string,file:string,seq_file_type:string>>' but 'stru
c' is found.

3-我尝试使用hiveserde而不是sparksql splliedjars,使用sparkshell的以下命令

hc.sql("SET spark.sql.hive.convertMetastoreParquet=false")

还是一样的错误。
它在hivewarehouse中创建表并加载数据,但是当我试图查询表甚至描述表时,它会给我错误。
4-假设此问题可能与spark sql和配置单元兼容性有关。我想修复这个模式,因为表是创建的,只有表模式是错误的。我试着手动修复模式,是的,这是一个很长很长时间的过程,最终修复了模式。我再次使用手动创建了配置单元表 Create table statement 如下所示。

CREATE TABLE `patent_data_2001`(
  `abstract` array<struct<id:string>> COMMENT '', 
  `claims` struct<claim:array<struct<id:string,num:string>>,id:string> COMMENT '', 
  `country` string COMMENT '', 
  `date_produced` string COMMENT '', 
  `date_publ` string COMMENT '', 
  `description` string COMMENT '', 
  `drawings`      struct<figure:array<struct<id:string,img:struct<alt:string,file:string,he:string,id:string,img_content:string,img_format:string,orientation:string,wi:string>,num:string>>,id:string> COMMENT '', 
  `dtd_version` string COMMENT '', 
  `file` string COMMENT '', 
  `id` string COMMENT '', 
  `lang` string COMMENT '', 
  `status` string COMMENT '', 
  `table_external_doc` array<string> COMMENT '', 
  `us_bibliographic_data_grant` struct<application_reference:struct<appl_type:string,document_id:struct<country:string,`date`:string,doc_number:string>>,assignees:struct<assignee:array<struct<addressbook:struct<address:struct<city:string,country:string,state:string>,first_name:string,last_name:string,orgname:string,role:string>,first_name:string,last_name:string,orgname:string,role:string>>>,classification_locarno:struct<edition:string,main_classification:string>,classification_national:array<struct<country:string,main_classification:string>>,classifications_cpc:struct<further_cpc:struct<classification_cpc:array<struct<action_date:struct<`date`:string>,classification_data_source:string,classification_status:string,classification_value:string,cpc_version_indicator:struct<`date`:string>,generating_office:struct<country:string>,main_group:string,scheme_origination_code:string,section:string,subclass:string,subgroup:string,symbol_position:string>>,combination_set:array<struct<combination_rank:array<struct<classification_cpc:struct<action_date:struct<`date`:string>,classification_data_source:string,classification_status:string,classification_value:string,cpc_version_indicator:struct<`date`:string>,generating_office:struct<country:string>,main_group:string,scheme_origination_code:string,section:string,subclass:string,subgroup:string,symbol_position:string>,rank_number:string>>,group_number:string>>>,main_cpc:struct<classification_cpc:struct<action_date:struct<`date`:string>,classification_data_source:string,classification_status:string,classification_value:string,cpc_version_indicator:struct<`date`:string>,generating_office:struct<country:string>,main_group:string,scheme_origination_code:string,section:string,subclass:string,subgroup:string,symbol_position:string>>>,classifications_ipcr:struct<classification_ipcr:array<struct<action_date:struct<`date`:string>,classification_data_source:string,classification_level:string,classification_status:string,classification_value:string,generating_office:struct<country:string>,ipc_version_indicator:struct<`date`:string>,main_group:string,section:string,subclass:string,subgroup:string,symbol_position:string>>>,examiners:struct<assistant_examiner:struct<first_name:string,last_name:string>,primary_examiner:struct<department:string,first_name:string,last_name:string>>,invention_title:string,number_of_claims:string,pct_or_regional_filing_data:struct<document_id:struct<country:string,`date`:string,doc_number:string,kind:string>,us_371c124_date:struct<`date`:string>,us_371c12_date:struct<`date`:string>>,pct_or_regional_publishing_data:struct<document_id:struct<country:string,`date`:string,doc_number:string,kind:string>>,priority_claims:struct<priority_claim:array<struct<country:string,`date`:string,doc_number:string,kind:string,sequence:string>>>,publication_reference:struct<document_id:struct<country:string,`date`:string,doc_number:string,kind:string>>,rule_47_flag:string,us_application_series_code:string,us_botanic:struct<latin_name:string,variety:string>,us_field_of_classification_search:struct<classification_national:array<struct<country:string,main_classification:string>>>,us_parties:struct<agents:struct<agent:array<struct<addressbook:array<struct<address:struct<country:string>,first_name:string,last_name:string,orgname:string>>,rep_type:string,sequence:string>>>,inventors:struct<inventor:array<struct<addressbook:array<struct<address:struct<city:string,country:string,state:string>,first_name:string,last_name:string>>,designation:string,sequence:string>>>,us_applicants:struct<us_applicant:array<struct<addressbook:array<struct<address:struct<city:string,country:string,state:string>,first_name:string,last_name:string,orgname:string>>,app_type:string,applicant_authority_category:string,designation:string,sequence:string>>>>,us_references_cited:struct<us_citation:array<struct<classification_national:array<struct<country:string,main_classification:string>>>>>,us_related_documents: struct < continuation: array< struct< relation: struct< child_doc: struct< document_id: struct < country: string,   `date`: string, doc_number: string > >, parent_doc: struct< document_id: struct< country: string,`date`:string, doc_number: string>, parent_grant_document: struct< document_id: struct< country: string,`date`:string, doc_number: string>>, parent_pct_document: struct< document_id: struct< country: string,`date`:string, doc_number: string>>, parent_status: string>>>>, continuation_in_part: array< struct < relation: struct< child_doc: struct< document_id: struct< country: string,`date`:string,doc_number: string>>, parent_doc: struct< document_id: struct< country: string,`date`:string,doc_number: string>, parent_grant_document: struct< document_id: struct< country: string,`date`:string,doc_number: string >>, parent_pct_document: struct< document_id: struct < country: string,`date`:string, doc_number: string>>, parent_status: string >>>>, division: array< struct< relation: struct< child_doc: struct< document_id: struct< country: string,`date`:string, doc_number: string>>, parent_doc: struct< document_id: struct< country: string,`date`:string, doc_number: string >, parent_grant_document: struct< document_id: struct< country: string,`date`:string, doc_number: string >>, parent_pct_document: struct< document_id: struct< country: string,`date`:string, doc_number: string>>, parent_status: string >>>>, reissue: array< struct< relation: struct < child_doc: struct < document_id: struct < country: string,`date`:string, doc_number: string>>, parent_doc: struct< document_id: struct < country: string,`date`:string,doc_number: string > ,parent_grant_document: struct < document_id: struct < country: string,`date`:string, doc_number: string >>, parent_pct_document: struct< document_id: struct < country: string,`date`:string, doc_number: string >>, parent_status: string >>>>, related_publication: array< struct < document_id: array < struct< country: string,`date`:string,doc_number: string, kind: string>>>>, substitution: array< struct< relation: struct < child_doc: struct < document_id: struct < country: string,`date`:string, doc_number: string >>, parent_doc: struct< document_id: struct < country: string,`date`:string, doc_number: string >, parent_status: string >>>>, us_provisional_application: array< struct < document_id: struct < country: string,`date`:string, doc_number: string >>>, us_term_of_grant: struct < disclaimer: array < struct< text: string>>>>>, 
  `us_chemistry` array<struct<cdx_file:string,idref:string,mol_file:string>> COMMENT '', 
  `us_claim_statement` string COMMENT '', 
  `us_math` array<struct<idrefs:string,img:struct<alt:string,file:string,he:string,id:string,img_content:string,img_format:string,wi:string>,nb_file:string>> COMMENT '', 
  `us_sequence_list_doc` struct<id:string,sequence_list:struct<carriers:string,file:string,seq_file_type:string>> COMMENT '')
  ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
WITH SERDEPROPERTIES ( 
  'path'='hdfs://cluster-A-XYZ:8020/user/hive/warehouse/mywarehouse/patent_data_2001') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  'hdfs://cluster-A-XYZ:8020/user/hive/warehouse/mywarehouse/patent_data_2001';

但我还是犯了同样的错误。
5-我试着使用下面链接中列出的serde

https://github.com/rcongiu/Hive-JSON-Serde
https://github.com/proofpoint/hive-serde
http://www.congiu.net/hive-json-serde/1.3.6/

但一点运气都没有。
6-在研究过程中,我发现这可能是这些jira中提到的一个问题


**ArrayIndexOutOfBounds exception for deeply nested structs**

https://issues.apache.org/jira/browse/HIVE-3253

**Support nested structs over 24 levels.**

https://issues.apache.org/jira/browse/HIVE-9500

我用below serde创建了表

ROW FORMAT SERDE   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ( 'hive.serialization.extend.nesting.levels'='true' )

但我得到了同样的错误,没有其他细节。
我不明白为什么会发生这些错误,以及为什么hive在允许我成功创建表时不允许我查询表。
任何帮助或建议都会很有帮助。请帮忙。
非常感谢。

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题