bert 当执行run_squad.py时，上下文中大量文本(5MB)的转换需要花费很长时间,

332nm8kg 于 5个月前发布在其他

关注(0)|答案(1)|浏览(69)

我运行了以下命令行：

python run_squad.py \
  --vocab_file=$BERT_LARGE_DIR/vocab.txt \
  --bert_config_file=$BERT_LARGE_DIR/bert_config.json \
  --init_checkpoint=$BERT_LARGE_DIR/model.ckpt \
  --do_train=False \
  --train_file=$SQUAD_DIR/train-v2.0.json \
  --do_predict=True \
  --predict_file=$SQUAD_DIR/dev-v2.0.json \
  --train_batch_size=24 \
  --learning_rate=3e-5 \
  --num_train_epochs=2.0 \
  --max_seq_length=384 \
  --doc_stride=128 \
  --output_dir=gs://some_bucket/squad_large/ \
  --use_tpu=True \
  --tpu_name=$TPU_NAME \
  --version_2_with_negative=True

dev-v2.0.json具有如下的json结构：

{
  "data": [
    {
      "paragraphs": [
        {
          "qas": [
            {
              "question": "question",
              "id": "65432sd54654dadaad"
            }
          ],
          "context": "paragraph"
        }
      ]
    }
  ]
}

当我在相同的上下文中提出问题时，它会被转换为1 0。如果上下文只有一个段落，那么对于每个问题，获取输出目录中的predictions.json需要一分钟的时间。
如果上下文太大，那么对于每个问题，运行需要几个小时。转换发生在哪里？所以我可以将上下文存储到转换后的数据中，并用于预测，而不是每次都尝试进行转换。

bert

来源：https://github.com/google-research/bert/issues/751