如何使用jq经济地从非常大的单个JSON文档的开头附近提取小的JSON片段？

czfnxgou 于 2023-01-14 发布在其他

关注(0)|答案(1)|浏览(138)

有问题的JSON文件相当大（~1.5GB），但在开头附近的已知位置（.meta.view.approvals）有一些元数据。
如何使用jq或gojq来提取该位置的对象，而不必将整个文件加载到内存中，也不必在提取了感兴趣的项之后等待整个文件的处理停止？
我们寻找一个泛型方法，但我感兴趣的特定文件是位于https://data.montgomerycountymd.gov/api/views/4mse-ku6q/rows.json的rows.json。我的副本是在2023年1月12日检索到的;文件大小为1459382170字节，文件中. meta.view.createdAt的值为1403103517
jq、gojq和jm的命令行替代方案也很有意义，只要它们在内存和CPU使用方面都比较经济。

JSON

来源：https://stackoverflow.com/questions/75116554/how-to-use-jq-to-economically-extract-a-small-json-fragment-from-near-the-beginn

1条答案

按热度按时间

2g32fytz1#

将jq（或gojq）的流解析器与过滤器"first_run"结合使用，如下所示。
例如，与使用非流式解析器相比，这减少了执行时间和存储器需求：从50秒到几微秒，从4，112MB的RAM（mrss）到3MB。
注：

JQ和GoJQ不产生相同的结果，因为GoJQ不考虑对象内键的排序。
下面显示的性能统计数据是针对Q.

下面是一段摘录，显示了在3GHz机器上的命令调用和关键性能统计信息。

/usr/bin/time -lp gojq -n --stream 'include "first_run" {search:"."};
  fromstream(3|truncate_stream(first_run(inputs;
    .[0][0:3] == ["meta","view", "approvals"]) ))' rows.json
    
user 0.00
sys 0.00
             3604480  maximum resident set size
             1409024  peak memory footprint

/usr/bin/time -lp jq -n --stream 'include "first_run" {search:"."};
  fromstream(3|truncate_stream(first_run(inputs;
    .[0][0:3] == ["meta","view", "approvals"]) ))' rows.json
user 0.00
sys 0.00
             2052096  maximum resident set size
             1175552  peak memory footprint

/usr/bin/time -lp jq .meta.view.approvals rows.json
user 39.90
sys 11.82
          4112465920  maximum resident set size
          6080188416  peak memory footprint

/usr/bin/time -lp gojq -n --stream '
  fromstream(3|truncate_stream(inputs | select(.[0][0:3] == ["meta","view", "approvals"]) ))' rows.json
user 495.30
sys 273.72
          7858896896  maximum resident set size
         38385831936  peak memory footprint

以下jm命令产生的结果基本相同：

/usr/bin/time -lp jm --pointer /meta/view/approvals rows.json
user 0.05
sys 0.07
            13594624  maximum resident set size
             7548928  peak memory footprint

下面是first_run的def：

# Emit the first run of the items in the stream for which the condition is truthy
def first_run(stream; condition):
  label $out
  | foreach stream as $x (null;
      ($x|condition) as $y
      | if $y
        then [$x]
        elif . then break $out
        else .
        end;
      if . then .[0] else empty end);

赞(0）回复(0）举报 2023-01-14

我来回答

如何使用jq经济地从非常大的单个JSON文档的开头附近提取小的JSON片段？

1条答案

相关问题

热门标签

最新问答