我得到了一个表,其中有1列,这是一个字符串类型,但内部看起来像一个JSON类型。
值如下所示
- 删除“old_id”**列,其中包含值
[{"name":"Entitas Penugasan","id":"6415","value":"HIJRA"},
{"name":"Function","id":"10594","value":"People & Culture"},
{"name":"Unit","id":"10595","value":"Organization Development"},
{"name":"Tribe","id":"10602","value":"Shared Service"}
]
字符串
- 删除“new_id”**列的值
[{"name":"Entitas Penugasan","id":"6415","value":"AFS"},
{"name":"Function","id":"10594","value":"Finance"},
{"name":"Unit","id":"10595","value":"Finance Operations"},
{"name":"Tribe","id":"10602","value":"Commercial"}
]
型
我需要SQL Athena查询,使列old_name,old_id,old_value,new_name,new_id,new_value从这些JSON列
我试过用
REGEXP_EXTRACT(old_id, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 1) AS old_name,
REGEXP_EXTRACT(new_id, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 1) AS new_name,
REGEXP_EXTRACT(old_id, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 2) AS old_id,
REGEXP_EXTRACT(new_id, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 2) AS new_id,
REGEXP_EXTRACT(old_id, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 3) AS old_value,
REGEXP_EXTRACT(new_id, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 3) AS new_value
型
但它只生成1行,即使在列中,它显示4个'数组'
查询应该生成4行,看起来像下面的x1c 0d1x
| ID|旧名|新名称|旧ID| new_id|旧值|新值|
| --|--|--|--|--|--|--|
| 一|恩蒂塔斯·佩努加桑|恩蒂塔斯·佩努加桑| 6415 | 6415 |Hijra| AFS|
| 一|功能|功能| 10594 | 10594 |人与文化|金融|
| 一|单元|单元| 10595 | 10595 |组织发展|融资业务|
| 一|部落|部落| 10602 | 10602 |共享服务|商业|
有没有办法在SQL Athena中做到这一点?
编辑:我做了一个小的进展与查询下面
with raw_data as(
select id, user_id, old_custom_fields, new_custom_fields
from my_table
where
-- new_custom_fields <> '' and new_custom_fields<> 'None' and new_custom_fields is not null and
id in (A)
),
splitted_data as (
SELECT id, user_id,
split(old_custom_fields, '},{') AS old_custom_field_id,
split(new_custom_fields, '},{') AS new_custom_field_id
FROM my_table
),
old_custom_field_id_unnest as (
SELECT
*
from splitted_data
CROSS JOIN UNNEST(old_custom_field_id) AS t (_old_custom_fields)
),
new_custom_field_id_unnest as (
SELECT
*
from splitted_data
CROSS JOIN UNNEST(new_custom_field_id) AS t (_new_custom_fields)
),
old_custom_field_cleaned as (
select id, old_custom_field_id,
REGEXP_EXTRACT(_old_custom_fields, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 1) AS old_name,
REGEXP_EXTRACT(_old_custom_fields, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 2) AS old_id,
REGEXP_EXTRACT(_old_custom_fields, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 3) AS old_value
from old_custom_field_id_unnest
),
new_custom_field_cleaned as (
select id, new_custom_field_id,
REGEXP_EXTRACT(_new_custom_fields, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 1) AS new_name,
REGEXP_EXTRACT(_new_custom_fields, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 2) AS new_id,
REGEXP_EXTRACT(_new_custom_fields, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 3) AS new_value
from new_custom_field_id_unnest
)
select oc.id, old_name, new_name,
old_id,new_id,
old_value,new_value
from old_custom_field_cleaned oc
join new_custom_field_cleaned nc on oc.id = nc.id
型
但这会导致重复的行,现在由于连接,我有16行,仍然需要帮助删除不需要的行
1条答案
按热度按时间s4n0splo1#
你的数据不仅仅看起来像JSON,样本数据也是JSON,因此将其作为一个处理。根据所使用的Presto/特里诺版本,实际处理可能会有所不同,但共同的部分始终是-解析JSON并将其转换为某种类型的
array
,然后取消嵌套。例如,您可以使用ROW(name varchar, id varchar, value varchar)
(根据具体情况,可以选择JSON
或MAP(varchar, varchar)
或MAP(varchar, JSON)
):字符串
输出量:
| 旧名|旧ID|旧值|新名称|new_id|新值|
| --|--|--|--|--|--|
| 恩蒂塔斯·佩努加桑| 6415 |Hijra|恩蒂塔斯·佩努加桑| 6415 |Hijra|
| 功能| 10594 |人与文化|功能| 10594 |人与文化|
| 单元| 10595 |组织发展|单元| 10595 |组织发展|
| 部落| 10602 |共享服务|部落| 10602 |共享服务|
上面的代码来自于假设数组中有“相同”的数据以正确的顺序。就个人而言,我会考虑基于id加入:
型
输出量:
| ID|旧名|旧值|新名称|新值|
| --|--|--|--|--|
| 6415 |恩蒂塔斯·佩努加桑|Hijra|恩蒂塔斯·佩努加桑|AFS|
| 10594 |功能|人与文化|功能|金融|
| 10595 |单元|组织发展|单元|融资业务|
| 10602 |部落|共享服务|部落|商业|