pytorch 梯度削波与梯度累积的关系

68bkxrlz 于 2023-10-20 发布在其他

关注(0)|答案(1)|浏览(104)

我尝试为RNN模型做梯度累积（从这个），所以我还需要裁剪梯度。例如，我决定使用梯度范数阈值= 1。
我应该增加max_norm（threshold = 1）对应于我累积的iters（4）吗？

loss_function = nn.BCEWithLogitsLoss()
num_batches = 1
running_loss = 0.0
iters_to_accumulate = 4
scaler = GradScaler()

model.train()
for batch in tqdm(train_generator, desc='Training'):
        with autocast(device_type=device.type, dtype=torch.float16):
            output = torch.flatten(model(batch['features'], batch['category'])).to(device)
            batch_loss = loss_function(output, batch['label'].float())
            batch_loss = batch_loss / iters_to_accumulate

        scaler.scale(batch_loss).backward()

        if (num_batches) % iters_to_accumulate == 0:
            scaler.unscale_(optimizer)
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.)
            scaler.step(optimizer)
            scaler.update()
            optimizer.zero_grad()

        num_batches += 1

pytorch

来源：https://stackoverflow.com/questions/77325527/relationship-between-gradient-clipping-and-gradient-accumulation