从类似JSON的字符串列创建新列

j2datikz  于 2023-11-20  发布在  其他
关注(0)|答案(1)|浏览(87)

我得到了一个表,其中有1列,这是一个字符串类型,但内部看起来像一个JSON类型。
值如下所示

  • 删除“old_id”**列,其中包含值
[{"name":"Entitas Penugasan","id":"6415","value":"HIJRA"},
 {"name":"Function","id":"10594","value":"People & Culture"},
 {"name":"Unit","id":"10595","value":"Organization Development"},
 {"name":"Tribe","id":"10602","value":"Shared Service"}
]

字符串

  • 删除“new_id”**列的值
[{"name":"Entitas Penugasan","id":"6415","value":"AFS"},
 {"name":"Function","id":"10594","value":"Finance"},
 {"name":"Unit","id":"10595","value":"Finance Operations"},
 {"name":"Tribe","id":"10602","value":"Commercial"}
]


我需要SQL Athena查询,使列old_name,old_id,old_value,new_name,new_id,new_value从这些JSON列
我试过用

REGEXP_EXTRACT(old_id, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 1) AS old_name,
REGEXP_EXTRACT(new_id, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 1) AS new_name,
REGEXP_EXTRACT(old_id, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 2) AS old_id,
REGEXP_EXTRACT(new_id, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 2) AS new_id,
REGEXP_EXTRACT(old_id, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 3) AS old_value,
REGEXP_EXTRACT(new_id, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 3) AS new_value


但它只生成1行,即使在列中,它显示4个'数组'
查询应该生成4行,看起来像下面的x1c 0d1x
| ID|旧名|新名称|旧ID| new_id|旧值|新值|
| --|--|--|--|--|--|--|
| 一|恩蒂塔斯·佩努加桑|恩蒂塔斯·佩努加桑| 6415 | 6415 |Hijra| AFS|
| 一|功能|功能| 10594 | 10594 |人与文化|金融|
| 一|单元|单元| 10595 | 10595 |组织发展|融资业务|
| 一|部落|部落| 10602 | 10602 |共享服务|商业|
有没有办法在SQL Athena中做到这一点?
编辑:我做了一个小的进展与查询下面

with raw_data as(
select id, user_id, old_custom_fields, new_custom_fields
    from my_table
    where 
    -- new_custom_fields <> '' and new_custom_fields<> 'None' and new_custom_fields is not null and
    id in (A)
),
splitted_data as (
    SELECT id, user_id,
    split(old_custom_fields, '},{') AS old_custom_field_id,
    split(new_custom_fields, '},{') AS new_custom_field_id
  FROM my_table
),
old_custom_field_id_unnest as (
SELECT
  *
 from splitted_data
CROSS JOIN UNNEST(old_custom_field_id) AS t (_old_custom_fields)
),
new_custom_field_id_unnest as (
SELECT
  *
 from splitted_data
CROSS JOIN UNNEST(new_custom_field_id) AS t (_new_custom_fields)
),
old_custom_field_cleaned as (
    select id, old_custom_field_id,
    REGEXP_EXTRACT(_old_custom_fields, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 1) AS old_name,
    REGEXP_EXTRACT(_old_custom_fields, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 2) AS old_id,
    REGEXP_EXTRACT(_old_custom_fields, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 3) AS old_value
    from old_custom_field_id_unnest
),
new_custom_field_cleaned as (
    select id, new_custom_field_id,
    REGEXP_EXTRACT(_new_custom_fields, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 1) AS new_name,
    REGEXP_EXTRACT(_new_custom_fields, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 2) AS new_id,
    REGEXP_EXTRACT(_new_custom_fields, '"name":"(.*?)","id":"(.*?)","value":"(.*?)"', 3) AS new_value
    from new_custom_field_id_unnest
)
select oc.id, old_name, new_name,
    old_id,new_id,
    old_value,new_value
    from old_custom_field_cleaned oc 
    join new_custom_field_cleaned nc on oc.id = nc.id


但这会导致重复的行,现在由于连接,我有16行,仍然需要帮助删除不需要的行

s4n0splo

s4n0splo1#

你的数据不仅仅看起来像JSON,样本数据也是JSON,因此将其作为一个处理。根据所使用的Presto/特里诺版本,实际处理可能会有所不同,但共同的部分始终是-解析JSON并将其转换为某种类型的array,然后取消嵌套。例如,您可以使用ROW(name varchar, id varchar, value varchar)(根据具体情况,可以选择JSONMAP(varchar, varchar)MAP(varchar, JSON)):

-- sample data
WITH dataset(old_id, new_id) AS (
    VALUES
        ('[{"id":"6415","value":"HIJRA", "name":"Entitas Penugasan"},
             {"name":"Function","id":"10594","value":"People & Culture"},
             {"name":"Unit","id":"10595","value":"Organization Development"},
             {"name":"Tribe","id":"10602","value":"Shared Service"}
          ]',
          '[{"name":"Entitas Penugasan","id":"6415","value":"AFS"},
             {"name":"Function","id":"10594","value":"Finance"},
             {"name":"Unit","id":"10595","value":"Finance Operations"},
             {"name":"Tribe","id":"10602","value":"Commercial"}
            ]'
         )
)

-- query
SELECT t.*
FROM dataset,
unnest(cast(json_parse(old_id) as array(row(name varchar, id varchar, value varchar))),
    cast(json_parse(old_id) as array(row(name varchar, id varchar, value varchar)))) as t(old_name, old_id, old_value, new_name, new_id, new_value); -- maybe as t(old, new) depending on engine and select t.old.name as old_name, ...

字符串
输出量:
| 旧名|旧ID|旧值|新名称|new_id|新值|
| --|--|--|--|--|--|
| 恩蒂塔斯·佩努加桑| 6415 |Hijra|恩蒂塔斯·佩努加桑| 6415 |Hijra|
| 功能| 10594 |人与文化|功能| 10594 |人与文化|
| 单元| 10595 |组织发展|单元| 10595 |组织发展|
| 部落| 10602 |共享服务|部落| 10602 |共享服务|
上面的代码来自于假设数组中有“相同”的数据以正确的顺序。就个人而言,我会考虑基于id加入:

-- sample data
WITH dataset(old_id, new_id) AS (
   -- ..
),

-- query 
old_values as (
    SELECT t.*
    FROM dataset,
    unnest(cast(json_parse(old_id) as array(row(name varchar, id varchar, value varchar)))) as t(name, id, value)
),
new_values as (
    SELECT t.*
    FROM dataset,
    unnest(cast(json_parse(new_id) as array(row(name varchar, id varchar, value varchar)))) as t(name, id, value)
)

SELECT o.id,
       o.name old_name,
       o.value old_value,
       n.name new_name,
       n.value new_value
FROM old_values as o
full outer join new_values as n on o.id = n.id;


输出量:
| ID|旧名|旧值|新名称|新值|
| --|--|--|--|--|
| 6415 |恩蒂塔斯·佩努加桑|Hijra|恩蒂塔斯·佩努加桑|AFS|
| 10594 |功能|人与文化|功能|金融|
| 10595 |单元|组织发展|单元|融资业务|
| 10602 |部落|共享服务|部落|商业|

相关问题