我在snowflake的一个变量列表中存储了数百万个json。它们的格式如下,尽管每个json的行数不同。
有人能给我一些关于如何将数据提取到一个平面表格中的指导吗?我对处理json文件还比较陌生,而且行数不一致,而且缺少定义对象名的指示符,这让我很困惑。
下面是一个json示例:
{
"Edge 93 Belgium 43-23-19 1932.Time_1_Avg.AB2 Weight on Bit": 0.2714572,
"Edge 93 Belgium 43-23-19 1932.Time_1_Avg.AB2 Weight on Bit unit": "",
"Edge 93 Belgium 43-23-19 1932.Time_1_Avg.AD Diff Press Gain SP": 0,
"Edge 93 Belgium 43-23-19 1932.Time_1_Avg.AD Diff Press Gain SP unit": "",
"Edge 93 Belgium 43-23-19 1932.Time_1_Avg.AD ROP": 0,
"Edge 93 Belgium 43-23-19 1932.Time_1_Avg.AD ROP unit": "",
"Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Calculated Pipe Displacement": -999.25,
"Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Calculated Pipe Displacement unit": "",
"Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Cumulative Delta Displacement": -999.25,
"Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Cumulative Delta Displacement unit": "",
"Edge 93 Belgium 43-23-19 1932.Time_1_Avg.FD Svy Quality": -999.25,
"Edge 93 Belgium 43-23-19 1932.Time_1_Avg.FD Svy Quality unit": "",
"Edge 93 Belgium 43-23-19 1932.Time_1_Avg.GWEX SampleFlow": -999.25,
"Edge 93 Belgium 43-23-19 1932.Time_1_Avg.GWEX SampleFlow unit": "",
"Edge 93 Belgium 43-23-19 1932.Time_1_Avg.MP3_STK": -999.25,
"Edge 93 Belgium 43-23-19 1932.Time_1_Avg.MP3_STK unit": "",
"Edge 93 Belgium 43-23-19 1932.Time_1_Avg.PT Correction": -999.25,
"Edge 93 Belgium 43-23-19 1932.Time_1_Avg.PT Correction unit": "",
"Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Pit 11 Jumps": -999.25,
"Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Pit 11 Jumps unit": "",
"Edge 93 Belgium 43-23-19 1932.Time_1_Avg.ROP - #1 Ref Time": -999.25,
"Edge 93 Belgium 43-23-19 1932.Time_1_Avg.ROP - #1 Ref Time unit": "",
"Edge 93 Belgium 43-23-19 1932.Time_1_Avg.TANK2_VOL": 8.732743,
"Edge 93 Belgium 43-23-19 1932.Time_1_Avg.TANK2_VOL unit": "",
"Edge 93 Belgium 43-23-19 1932.Time_1_Avg.TANK4_VOL": 16.13105,
"Edge 93 Belgium 43-23-19 1932.Time_1_Avg.TANK4_VOL unit": "",
"Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Time On Slip": 1.3,
"Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Time On Slip unit": "",
"Edge 93 Belgium 43-23-19 1932.Time_1_Avg.WPDA - Mud Motor Torque": -999.25,
"Edge 93 Belgium 43-23-19 1932.Time_1_Avg.WPDA - Mud Motor Torque unit": "",
"Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Washout Factor": 4.167005,
"Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Washout Factor unit": "",
"DeviceId": "streamingdevice",
"EventEnqueuedUtcTime": "2020-05-04T22:12:21.5310000Z",
"EventProcessedUtcTime": "2020-05-04T22:12:35.6868329Z",
"IoTHub": {
"ConnectionDeviceGenerationId": "637199801617320690",
"ConnectionDeviceId": "streamingdevice",
"CorrelationId": null,
"EnqueuedTime": "2020-05-04T22:12:21.0000000",
"MessageId": null,
"StreamId": null
},
"PartitionId": 1,
"Timestamp": "2019-10-30 13:48:05.000000"
}
“edge 93 belgium 43-23-19 1932”是对象名称;每个json只针对一个对象。
“time_1_avg.ab2 weight on bit”是读取类型,基本上由tag1.tag2组成。
行的最后一部分是读取值。
json底部的时间戳是读取时间。
本节不是必需的:
"DeviceId": "streamingdevice",
"EventEnqueuedUtcTime": "2020-05-04T22:12:21.5310000Z",
"EventProcessedUtcTime": "2020-05-04T22:12:35.6868329Z",
"IoTHub": {
"ConnectionDeviceGenerationId": "637199801617320690",
"ConnectionDeviceId": "streamingdevice",
"CorrelationId": null,
"EnqueuedTime": "2020-05-04T22:12:21.0000000",
"MessageId": null,
"StreamId": null
},
"PartitionId": 1,
此数据的理想输出为:
但只要得到这样的东西就很有帮助了:
谢谢你的帮助!
1条答案
按热度按时间6l7fqoea1#
假设所需的键始终有3个周期分隔的组件,则以下是解决方案的一种形式:
使用
FLATTEN
表函数取任意VARIANT
从表中键入列(示例中为1行常量),并将其分解为多行依赖于生成的
THIS
列(从FLATTEN
表)发出行常量值(Timestamp
)对于每个分解的行使用
NOT IN
筛选以排除不需要的键名使用
SPLIT
具有索引的函数将提取的键划分为多列这将产生一个类似于第一个屏幕截图的结果: