我有一个csv(或tsv),其中一列('nw\u day'在下面的示例中)具有序列化的数组对象,另一列('res\u m'在下面的示例中)具有序列化的json对象。它还具有string、timestamp和float数据类型的列。
对于看起来有点像的tsv(显示第一行)
+----------+---------------------+-------+-----------------------------------------------+------------------------------------------------------------------------+
| com_id | w_start_time | cap | nw_day | res_m |
+----------+---------------------+-------+-----------------------------------------------+------------------------------------------------------------------------+
| dtf_id | 2019-04-24 06:00:03 | 444.3 | {'Fri','Mon','Sat','Sun','Thurs','Tue','Wed'} | {"some_str":"str_one","some_n":1,"some_t":2019-04-24 06:00:03.700+0000}|
+----------+---------------------+-------+-----------------------------------------------+------------------------------------------------------------------------+
我试过下面的说法,但没有给我完美的结果。
CREATE EXTERNAL TABLE IF NOT EXISTS table_name(
com_id STRING,
w_start_time TIMESTAMP,
cap FLOAT,
nw_day array <STRING>,
res_m STRUCT <
some_str: STRING,
some_n: BIGINT,
some_t: TIMESTAMP
>)
COMMENT 's_e_s'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/location/to/folder/containing/csv'
TBLPROPERTIES ("skip.header.line.count"="1");
所以,我想我反序列化这些对象到配置单元复杂的数据类型与数组和结构。但这并不是我跑步时得到的
select * from table_name limit 1;
这让我
+----------+---------------------+-------+----------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+
| com_id | w_start_time | cap | nw_day | res_m |
+----------+---------------------+-------+----------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+
| dtf_id | 2019-04-24 06:00:03 | 444.3 | ["{'Fri'"," 'Mon'"," 'Sat'"," 'Sun'"," 'Thurs'"," 'Tue'"," 'Wed'}"] | {"some_str":"{\"some_str\":\"str_one\",\"some_n\":1,\"some_t\":2019-04-24 06:00:03.700+0000}\","some_n":null,"some_t":null}|
+----------+---------------------+-------+----------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+
因此,它将整个对象看作一个字符串,并用分隔符拆分字符串。
我需要一些帮助,了解如何将数据从csv/tsv加载到配置单元中的复杂数据类型。
我发现了一个类似的问题,但是需求有点不同,并且没有涉及复杂的数据类型。
任何帮助都将不胜感激。如果不能做到这一点,并且必须在加载之前包含预处理步骤,那么hive中复杂数据类型加载的输入数据示例将对我有所帮助。提前谢谢!
暂无答案!
目前还没有任何答案,快来回答吧!