在PowerShell中解析大型JSON文件

nnvyjq4y  于 2023-08-08  发布在  Shell
关注(0)|答案(1)|浏览(157)

Context

在这篇文章中:
ConvertFrom-Json with large file
我询问了关于反序列化1.2GB JSON文件的问题。
这个答案贴在那里:
https://stackoverflow.com/a/76791900/268581
确实有效,但是非常慢

示例数据

所以你不必使用1.2GB的文件,这里有一个小的数据例子来回答这个问题。这只是原始大JSON文件中的前几项。
example.json

[{"py/object": "polygon.websocket.models.models.EquityTrade", "event_type": "T", "symbol": "O:AMD230728C00115000", "exchange": 304, "id": null, "tape": null, "price": 0.38, "size": 1, "conditions": [227], "timestamp": 1690471217275, "sequence_number": 1477738810, "trf_id": null, "trf_timestamp": null}, {"py/object": "polygon.websocket.models.models.EquityTrade", "event_type": "T", "symbol": "O:AFRM230728C00019500", "exchange": 302, "id": null, "tape": null, "price": 0.07, "size": 10, "conditions": [209], "timestamp": 1690471217278, "sequence_number": 1477739110, "trf_id": null, "trf_timestamp": null}, {"py/object": "polygon.websocket.models.models.EquityTrade", "event_type": "T", "symbol": "O:TSLA230804C00270000", "exchange": 325, "id": null, "tape": null, "price": 4.8, "size": 7, "conditions": [219], "timestamp": 1690471217282, "sequence_number": 341519150, "trf_id": null, "trf_timestamp": null}, {"py/object": "polygon.websocket.models.models.EquityTrade", "event_type": "T", "symbol": "O:TSLA230804C00270000", "exchange": 312, "id": null, "tape": null, "price": 4.8, "size": 1, "conditions": [209], "timestamp": 1690471217282, "sequence_number": 341519166, "trf_id": null, "trf_timestamp": null}, {"py/object": "polygon.websocket.models.models.EquityTrade", "event_type": "T", "symbol": "O:TSLA230804C00270000", "exchange": 312, "id": null, "tape": null, "price": 4.8, "size": 1, "conditions": [209], "timestamp": 1690471217282, "sequence_number": 341519167, "trf_id": null, "trf_timestamp": null}, {"py/object": "polygon.websocket.models.models.EquityTrade", "event_type": "T", "symbol": "O:TSLA230804C00270000", "exchange": 319, "id": null, "tape": null, "price": 4.8, "size": 5, "conditions": [219], "timestamp": 1690471217282, "sequence_number": 341519170, "trf_id": null, "trf_timestamp": null}, {"py/object": "polygon.websocket.models.models.EquityTrade", "event_type": "T", "symbol": "O:TSLA230804C00270000", "exchange": 312, "id": null, "tape": null, "price": 4.8, "size": 19, "conditions": [209], "timestamp": 1690471217284, "sequence_number": 341519682, "trf_id": null, "trf_timestamp": null}, {"py/object": "polygon.websocket.models.models.EquityTrade", "event_type": "T", "symbol": "O:TSLA230804C00270000", "exchange": 301, "id": null, "tape": null, "price": 4.8, "size": 2, "conditions": [219], "timestamp": 1690471217290, "sequence_number": 341519926, "trf_id": null, "trf_timestamp": null}, {"py/object": "polygon.websocket.models.models.EquityTrade", "event_type": "T", "symbol": "O:TSLA230804C00270000", "exchange": 301, "id": null, "tape": null, "price": 4.8, "size": 15, "conditions": [219], "timestamp": 1690471217290, "sequence_number": 341519927, "trf_id": null, "trf_timestamp": null}, {"py/object": "polygon.websocket.models.models.EquityTrade", "event_type": "T", "symbol": "O:META230728C00315000", "exchange": 302, "id": null, "tape": null, "price": 4.76, "size": 1, "conditions": [227], "timestamp": 1690471217323, "sequence_number": 1290750877, "trf_id": null, "trf_timestamp": null}]

字符串

代码

下面是一些可以工作的(慢)代码。在1.2GB的文件上运行需要几个小时。

$path = ".\example.json"

$stream = [System.IO.File]::Open($path, [System.IO.FileMode]::Open)

$i = 0
$stream.ReadByte() # read '['
$i++

$json = ''

$data = @()

while ($i -lt $stream.Length)
{
    $byte = $stream.ReadByte(); $i++

    $char = [Convert]::ToChar($byte)
            
    if ($char -eq '}')
    {
        $json = $json + [Convert]::ToChar($byte)
        
        $data = $data + ($json | ConvertFrom-Json)

        $json = ''

        $stream.ReadByte() | Out-Null # read comma;
        $i++

        if ($data.Count % 100 -eq 0)
        {
            Write-Host $data.Count
        }
    }
    else
    {
        $json = $json + [Convert]::ToChar($byte)
    }
}

$stream.Close()


运行它之后,你应该在$data中有记录:

PS C:\Users\dharm\Dropbox\Documents\polygon-io.ps1> $data | ft *

py/object                                   event_type symbol                exchange id tape price size conditions     timestamp sequence_number trf_id trf_timestamp
---------                                   ---------- ------                -------- -- ---- ----- ---- ----------     --------- --------------- ------ -------------
polygon.websocket.models.models.EquityTrade T          O:AMD230728C00115000       304          0.38    1 {227}      1690471217275      1477738810
polygon.websocket.models.models.EquityTrade T          O:AFRM230728C00019500      302          0.07   10 {209}      1690471217278      1477739110
polygon.websocket.models.models.EquityTrade T          O:TSLA230804C00270000      325           4.8    7 {219}      1690471217282       341519150
polygon.websocket.models.models.EquityTrade T          O:TSLA230804C00270000      312           4.8    1 {209}      1690471217282       341519166
polygon.websocket.models.models.EquityTrade T          O:TSLA230804C00270000      312           4.8    1 {209}      1690471217282       341519167
polygon.websocket.models.models.EquityTrade T          O:TSLA230804C00270000      319           4.8    5 {219}      1690471217282       341519170
polygon.websocket.models.models.EquityTrade T          O:TSLA230804C00270000      312           4.8   19 {209}      1690471217284       341519682
polygon.websocket.models.models.EquityTrade T          O:TSLA230804C00270000      301           4.8    2 {219}      1690471217290       341519926
polygon.websocket.models.models.EquityTrade T          O:TSLA230804C00270000      301           4.8   15 {219}      1690471217290       341519927
polygon.websocket.models.models.EquityTrade T          O:META230728C00315000      302          4.76    1 {227}      1690471217323      1290750877

提问

有什么好办法能让这个过程更有效率?

注意事项

这个答案:
https://stackoverflow.com/a/43747641/268581
确实说明了使用Newtonsoft Json.NET的C#方法。
代码如下:

JsonSerializer serializer = new JsonSerializer();
MyObject o;
using (FileStream s = File.Open("bigfile.json", FileMode.Open))
using (StreamReader sr = new StreamReader(s))
using (JsonReader reader = new JsonTextReader(sr))
{
    while (reader.Read())
    {
        // deserialize only when there's "{" character in the stream
        if (reader.TokenType == JsonToken.StartObject)
        {
            o = serializer.Deserialize<MyObject>(reader);
        }
    }
}


一种方法是下载Newtonsoft Json.NET DLL,并将其转换为PowerShell。一个挑战是这条线:

o = serializer.Deserialize<MyObject>(reader);


正如您所看到的,它正在进行一个泛型方法调用。我不清楚这将如何转换为Windows PowerShell 5.1。
最好是只依赖于原生JSON反序列化库的解决方案,但如果需要,也可以使用Newtonsoft方法。

nfg76nw0

nfg76nw01#

这是一个适用于PowerShell 7.3.6的版本。
它只需要几分钟,这对我的情况是可以接受的。

Add-Type -Path 'C:\Users\dharm\OneDrive\Documents\WindowsPowerShell\Modules\newtonsoft.json\1.0.2.201\libs\Newtonsoft.Json.dll'

class Row {
    [string]$symbol
    [Int64]$timestamp
}

$serializer = [Newtonsoft.Json.JsonSerializer]::new()

# $stream = [System.IO.File]::Open("..\polygon-io.py\data-2023-07-27-13-33-26.json", [System.IO.FileMode]::Open)

$stream = [System.IO.File]::Open("C:\Users\dharm\Dropbox\Documents\polygon-io.py\data-2023-07-27-13-33-26.json", [System.IO.FileMode]::Open)

$reader = [System.IO.StreamReader]::new($stream)

$json_reader = [Newtonsoft.Json.JsonTextReader]::new($reader)

$ls = New-Object System.Collections.Generic.List[Row]

$i = 0

while ($json_reader.Read())
{
    if ($i % 1000 -eq 0) { Write-Host $i }
    
    if ($json_reader.TokenType -eq [Newtonsoft.Json.JsonToken]::StartObject)
    {        
        $obj = $serializer.Deserialize[Row]($json_reader)
        
        $ls.Add($obj)
    }

    $i++
}

$stream.Close()
$reader.Close()
$json_reader.Close()

# ----------------------------------------------------------------------
# examine result
# ----------------------------------------------------------------------

$ls.Count

$ls | Select-Object -First 10 | ft *

字符串

相关问题