Context
在这篇文章中:
ConvertFrom-Json with large file
我询问了关于反序列化1.2GB JSON文件的问题。
这个答案贴在那里:
https://stackoverflow.com/a/76791900/268581的
确实有效,但是非常慢
示例数据
所以你不必使用1.2GB的文件,这里有一个小的数据例子来回答这个问题。这只是原始大JSON文件中的前几项。example.json
:
[{"py/object": "polygon.websocket.models.models.EquityTrade", "event_type": "T", "symbol": "O:AMD230728C00115000", "exchange": 304, "id": null, "tape": null, "price": 0.38, "size": 1, "conditions": [227], "timestamp": 1690471217275, "sequence_number": 1477738810, "trf_id": null, "trf_timestamp": null}, {"py/object": "polygon.websocket.models.models.EquityTrade", "event_type": "T", "symbol": "O:AFRM230728C00019500", "exchange": 302, "id": null, "tape": null, "price": 0.07, "size": 10, "conditions": [209], "timestamp": 1690471217278, "sequence_number": 1477739110, "trf_id": null, "trf_timestamp": null}, {"py/object": "polygon.websocket.models.models.EquityTrade", "event_type": "T", "symbol": "O:TSLA230804C00270000", "exchange": 325, "id": null, "tape": null, "price": 4.8, "size": 7, "conditions": [219], "timestamp": 1690471217282, "sequence_number": 341519150, "trf_id": null, "trf_timestamp": null}, {"py/object": "polygon.websocket.models.models.EquityTrade", "event_type": "T", "symbol": "O:TSLA230804C00270000", "exchange": 312, "id": null, "tape": null, "price": 4.8, "size": 1, "conditions": [209], "timestamp": 1690471217282, "sequence_number": 341519166, "trf_id": null, "trf_timestamp": null}, {"py/object": "polygon.websocket.models.models.EquityTrade", "event_type": "T", "symbol": "O:TSLA230804C00270000", "exchange": 312, "id": null, "tape": null, "price": 4.8, "size": 1, "conditions": [209], "timestamp": 1690471217282, "sequence_number": 341519167, "trf_id": null, "trf_timestamp": null}, {"py/object": "polygon.websocket.models.models.EquityTrade", "event_type": "T", "symbol": "O:TSLA230804C00270000", "exchange": 319, "id": null, "tape": null, "price": 4.8, "size": 5, "conditions": [219], "timestamp": 1690471217282, "sequence_number": 341519170, "trf_id": null, "trf_timestamp": null}, {"py/object": "polygon.websocket.models.models.EquityTrade", "event_type": "T", "symbol": "O:TSLA230804C00270000", "exchange": 312, "id": null, "tape": null, "price": 4.8, "size": 19, "conditions": [209], "timestamp": 1690471217284, "sequence_number": 341519682, "trf_id": null, "trf_timestamp": null}, {"py/object": "polygon.websocket.models.models.EquityTrade", "event_type": "T", "symbol": "O:TSLA230804C00270000", "exchange": 301, "id": null, "tape": null, "price": 4.8, "size": 2, "conditions": [219], "timestamp": 1690471217290, "sequence_number": 341519926, "trf_id": null, "trf_timestamp": null}, {"py/object": "polygon.websocket.models.models.EquityTrade", "event_type": "T", "symbol": "O:TSLA230804C00270000", "exchange": 301, "id": null, "tape": null, "price": 4.8, "size": 15, "conditions": [219], "timestamp": 1690471217290, "sequence_number": 341519927, "trf_id": null, "trf_timestamp": null}, {"py/object": "polygon.websocket.models.models.EquityTrade", "event_type": "T", "symbol": "O:META230728C00315000", "exchange": 302, "id": null, "tape": null, "price": 4.76, "size": 1, "conditions": [227], "timestamp": 1690471217323, "sequence_number": 1290750877, "trf_id": null, "trf_timestamp": null}]
字符串
代码
下面是一些可以工作的(慢)代码。在1.2GB的文件上运行需要几个小时。
$path = ".\example.json"
$stream = [System.IO.File]::Open($path, [System.IO.FileMode]::Open)
$i = 0
$stream.ReadByte() # read '['
$i++
$json = ''
$data = @()
while ($i -lt $stream.Length)
{
$byte = $stream.ReadByte(); $i++
$char = [Convert]::ToChar($byte)
if ($char -eq '}')
{
$json = $json + [Convert]::ToChar($byte)
$data = $data + ($json | ConvertFrom-Json)
$json = ''
$stream.ReadByte() | Out-Null # read comma;
$i++
if ($data.Count % 100 -eq 0)
{
Write-Host $data.Count
}
}
else
{
$json = $json + [Convert]::ToChar($byte)
}
}
$stream.Close()
型
运行它之后,你应该在$data
中有记录:
PS C:\Users\dharm\Dropbox\Documents\polygon-io.ps1> $data | ft *
py/object event_type symbol exchange id tape price size conditions timestamp sequence_number trf_id trf_timestamp
--------- ---------- ------ -------- -- ---- ----- ---- ---------- --------- --------------- ------ -------------
polygon.websocket.models.models.EquityTrade T O:AMD230728C00115000 304 0.38 1 {227} 1690471217275 1477738810
polygon.websocket.models.models.EquityTrade T O:AFRM230728C00019500 302 0.07 10 {209} 1690471217278 1477739110
polygon.websocket.models.models.EquityTrade T O:TSLA230804C00270000 325 4.8 7 {219} 1690471217282 341519150
polygon.websocket.models.models.EquityTrade T O:TSLA230804C00270000 312 4.8 1 {209} 1690471217282 341519166
polygon.websocket.models.models.EquityTrade T O:TSLA230804C00270000 312 4.8 1 {209} 1690471217282 341519167
polygon.websocket.models.models.EquityTrade T O:TSLA230804C00270000 319 4.8 5 {219} 1690471217282 341519170
polygon.websocket.models.models.EquityTrade T O:TSLA230804C00270000 312 4.8 19 {209} 1690471217284 341519682
polygon.websocket.models.models.EquityTrade T O:TSLA230804C00270000 301 4.8 2 {219} 1690471217290 341519926
polygon.websocket.models.models.EquityTrade T O:TSLA230804C00270000 301 4.8 15 {219} 1690471217290 341519927
polygon.websocket.models.models.EquityTrade T O:META230728C00315000 302 4.76 1 {227} 1690471217323 1290750877
型
提问
有什么好办法能让这个过程更有效率?
注意事项
这个答案:
https://stackoverflow.com/a/43747641/268581的
确实说明了使用Newtonsoft Json.NET的C#方法。
代码如下:
JsonSerializer serializer = new JsonSerializer();
MyObject o;
using (FileStream s = File.Open("bigfile.json", FileMode.Open))
using (StreamReader sr = new StreamReader(s))
using (JsonReader reader = new JsonTextReader(sr))
{
while (reader.Read())
{
// deserialize only when there's "{" character in the stream
if (reader.TokenType == JsonToken.StartObject)
{
o = serializer.Deserialize<MyObject>(reader);
}
}
}
型
一种方法是下载Newtonsoft Json.NET DLL,并将其转换为PowerShell。一个挑战是这条线:
o = serializer.Deserialize<MyObject>(reader);
型
正如您所看到的,它正在进行一个泛型方法调用。我不清楚这将如何转换为Windows PowerShell 5.1。
最好是只依赖于原生JSON反序列化库的解决方案,但如果需要,也可以使用Newtonsoft方法。
1条答案
按热度按时间nfg76nw01#
这是一个适用于PowerShell 7.3.6的版本。
它只需要几分钟,这对我的情况是可以接受的。
字符串