使用jq如何将一个非常大的JSON文件拆分为多个文件，每个文件包含特定数量的对象？

smdnsysy 于 2022-11-19 发布在其他

关注(0)|答案(2)|浏览(441)

我有一个很大的JSON文件，我猜有400万个对象。每个顶层都有几个嵌套的层次。我想把它分成多个文件，每个文件有10000个顶层对象（保留每个文件的结构）。jq应该能做到这一点吧？我不知道怎么做。
所以数据是这样的：

[{
  "id": 1,
  "user": {
    "name": "Nichols Cockle",
    "email": "ncockle0@tmall.com",
    "address": {
      "city": "Turt",
      "state": "Thị Trấn Yên Phú"
    }
  },
  "product": {
    "name": "Lychee - Canned",
    "code": "36987-1526"
  }
}, {
  "id": 2,
  "user": {
    "name": "Isacco Scrancher",
    "email": "iscrancher1@aol.com",
    "address": {
      "city": "Likwatang Timur",
      "state": "Biharamulo"
    }
  },
  "product": {
    "name": "Beer - Original Organic Lager",
    "code": "47993-200"
  }
}, {
  "id": 3,
  "user": {
    "name": "Elga Sikora",
    "email": "esikora2@statcounter.com",
    "address": {
      "city": "Wenheng",
      "state": "Piedra del Águila"
    }
  },
  "product": {
    "name": "Parsley - Dried",
    "code": "36987-1632"
  }
}, {
  "id": 4,
  "user": {
    "name": "Andria Keatch",
    "email": "akeatch3@salon.com",
    "address": {
      "city": "Arras",
      "state": "Iracemápolis"
    }
  },
  "product": {
    "name": "Wine - Segura Viudas Aria Brut",
    "code": "51079-385"
  }
}, {
  "id": 5,
  "user": {
    "name": "Dara Sprowle",
    "email": "dsprowle4@slate.com",
    "address": {
      "city": "Huatai",
      "state": "Kaduna"
    }
  },
  "product": {
    "name": "Pork - Hock And Feet Attached",
    "code": "0054-8648"
  }
}]

如果这是一个完整的对象：

{
  "id": 1,
  "user": {
    "name": "Nichols Cockle",
    "email": "ncockle0@tmall.com",
    "address": {
      "city": "Turt",
      "state": "Thị Trấn Yên Phú"
    }
  },
  "product": {
    "name": "Lychee - Canned",
    "code": "36987-1526"
  }
}

每个文件都是指定数量的对象。

JSON

来源：https://stackoverflow.com/questions/49808581/using-jq-how-can-i-split-a-very-large-json-file-into-multiple-files-each-a-spec

2条答案

按热度按时间

ogq8wdun1#

[EDIT：本答复已根据问题的修订内容进行了修订。]
使用jq解决这个问题的关键是-c命令行选项，它生成JSON行格式的输出（即，在本例中，每行一个对象），然后可以使用awk或split之类的工具将这些行分发到多个文件中。
如果文件不是太大，那么最简单的方法是使用以下内容启动管道：

jq -c '.[]' INPUTFILE

如果文件太大，内存中放不下，那么可以使用jq的流解析器，如下所示：

jq -cn --stream 'fromstream(1|truncate_stream(inputs))'

或者，您可以使用命令行工具，如jstream或jm，这将是更快的，但当然必须安装。
有关jq的流解析器的进一步讨论，请参见jq常见问题解答中的相关章节：https://github.com/stedolan/jq/wiki/FAQ#streaming-json-parser

分区

有关对第一步中生成的输出进行分区的不同方法，请参见例如How can I split a large text file into smaller files with an equal number of lines?
如果要求每个输出文件都是一个对象数组，那么我可能会使用awk在一个步骤中执行分区和重构，但还有许多其他合理的方法。

如果输入是JSON对象序列

作为参考，如果原始文件由JSON对象的流或序列组成，则适当的调用应为：

jq -n -c inputs INPUTFILE

以这种方式使用inputs允许有效地处理任意多个对象。

赞(0）回复(0）举报 2022-11-19

6jygbczu2#

可以使用jq对json文件或流进行切片。请参见下面的脚本。sliceSize参数设置切片的大小，并确定同时在内存中保存多少个输入。这允许控制内存使用。

要切片的输入

输入不必格式化。
输入可能：

Json输入数组
Json输入流

切片输出

可以使用格式化或压缩Json创建文件
切片输出文件可以包含：

大小为$sliceSize的Json输入数组
具有$sliceSize项的Json输入流

性能

一个快速基准测试显示了切片过程中的时间和内存消耗（在我的笔记本电脑上测量）

包含100.000个json对象的文件，46 MB

切片大小=5.000：时间=35秒
切片大小=10.000：时间=40秒
切片大小=25.000：时间=1分钟
切片大小=50.000：时间=1分52秒

包含1.000.000个json对象的文件，450 MB

切片大小=5000：时间=5分45秒
切片大小=10.000：时间=6分51秒
切片大小=25.000：时间=10分5秒
切片大小=50.000：时间=18分46秒，最大内存消耗：约150 MB
切片大小=100.000：时间=46分25秒，最大内存消耗：约300 MB

#!/bin/bash

SLICE_SIZE=2

JQ_SLICE_INPUTS='
   2376123525 as $EOF |            # random number that does not occur in the input stream to mark the end of the stream
   foreach (inputs, $EOF) as $input
   (
      # init state
      [[], []];                    # .[0]: array to collect inputs
                                   # .[1]: array that has collected $sliceSize inputs and is ready to be extracted
      # update state
      if .[0] | length == $sliceSize   # enough inputs collected
         or $input == $EOF             # or end of stream reached
      then [[$input], .[0]]        # create new array to collect next inputs. Save array .[0] with $sliceSize inputs for extraction
      else [.[0] + [$input], []]   # collect input, nothing to extract after this state update
      end;

      # extract from state
      if .[1] | length != 0
      then .[1]                    # extract array that has collected $sliceSize inputs
      else empty                   # nothing to extract right now (because still collecting inputs into .[0])
      end
   )
'

write_files() {
  local FILE_NAME_PREFIX=$1
  local FILE_COUNTER=0
  while read line; do
    FILE_COUNTER=$((FILE_COUNTER + 1))
    FILE_NAME="${FILE_NAME_PREFIX}_$FILE_COUNTER.json"
    echo "writing $FILE_NAME"
    jq '.'      > $FILE_NAME <<< "$line"   # array of formatted json inputs
#   jq -c '.'   > $FILE_NAME <<< "$line"   # compact array of json inputs
#   jq '.[]'    > $FILE_NAME <<< "$line"   # stream of formatted json inputs
#   jq -c '.[]' > $FILE_NAME <<< "$line"   # stream of compact json inputs
  done
}

echo "how to slice a stream of json inputs"
jq -n '{id: (range(5) + 1), a:[1,2]}' |   # create a stream of json inputs
jq -n -c --argjson sliceSize $SLICE_SIZE "$JQ_SLICE_INPUTS" |
write_files "stream_of_json_inputs_sliced"

echo -e "\nhow to slice an array of json inputs"
jq -n '[{id: (range(5) + 1), a:[1,2]}]' |                  # create an array of json inputs
jq -n --stream 'fromstream(1|truncate_stream(inputs))' |   # remove outer array to create stream of json inputs
jq -n -c --argjson sliceSize $SLICE_SIZE "$JQ_SLICE_INPUTS" |
write_files "array_of_json_inputs_sliced"

脚本输出

how to slice a stream of json inputs
writing stream_of_json_inputs_sliced_1.json
writing stream_of_json_inputs_sliced_2.json
writing stream_of_json_inputs_sliced_3.json

how to slice an array of json inputs
writing array_of_json_inputs_sliced_1.json
writing array_of_json_inputs_sliced_2.json
writing array_of_json_inputs_sliced_3.json

生成的文件

`array_of_json_inputs_sliced_1.json`

[
  {
    "id": 1,
    "a": [1,2]
  },
  {
    "id": 2,
    "a": [1,2]
  }
]

`array_of_json_inputs_sliced_2.json`

[
  {
    "id": 3,
    "a": [1,2]
  },
  {
    "id": 4,
    "a": [1,2]
  }
]

`array_of_json_inputs_sliced_3.json`

[
  {
    "id": 5,
    "a": [1,2]
  }
]