pytorch 梯度削波与梯度累积的关系

68bkxrlz  于 2023-10-20  发布在  其他
关注(0)|答案(1)|浏览(104)

我尝试为RNN模型做梯度累积(从这个),所以我还需要裁剪梯度。例如,我决定使用梯度范数阈值= 1。
我应该增加max_norm(threshold = 1)对应于我累积的iters(4)吗?

loss_function = nn.BCEWithLogitsLoss()
num_batches = 1
running_loss = 0.0
iters_to_accumulate = 4
scaler = GradScaler()

model.train()
for batch in tqdm(train_generator, desc='Training'):
        with autocast(device_type=device.type, dtype=torch.float16):
            output = torch.flatten(model(batch['features'], batch['category'])).to(device)
            batch_loss = loss_function(output, batch['label'].float())
            batch_loss = batch_loss / iters_to_accumulate

        scaler.scale(batch_loss).backward()

        if (num_batches) % iters_to_accumulate == 0:
            scaler.unscale_(optimizer)
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.)
            scaler.step(optimizer)
            scaler.update()
            optimizer.zero_grad()

        num_batches += 1
efzxgjgh

efzxgjgh1#

您不需要更改渐变剪裁值来考虑渐变累积。在梯度累积中,梯度是加性的,但这通过缩放损失(batch_loss = batch_loss / iters_to_accumulate)来校正。损失缩放确保n批次累积的组合梯度与单个大批次的梯度具有相同的幅度。

相关问题