我尝试为RNN模型做梯度累积(从这个),所以我还需要裁剪梯度。例如,我决定使用梯度范数阈值= 1。
我应该增加max_norm(threshold = 1)对应于我累积的iters(4)吗?
loss_function = nn.BCEWithLogitsLoss()
num_batches = 1
running_loss = 0.0
iters_to_accumulate = 4
scaler = GradScaler()
model.train()
for batch in tqdm(train_generator, desc='Training'):
with autocast(device_type=device.type, dtype=torch.float16):
output = torch.flatten(model(batch['features'], batch['category'])).to(device)
batch_loss = loss_function(output, batch['label'].float())
batch_loss = batch_loss / iters_to_accumulate
scaler.scale(batch_loss).backward()
if (num_batches) % iters_to_accumulate == 0:
scaler.unscale_(optimizer)
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.)
scaler.step(optimizer)
scaler.update()
optimizer.zero_grad()
num_batches += 1
1条答案
按热度按时间efzxgjgh1#
您不需要更改渐变剪裁值来考虑渐变累积。在梯度累积中,梯度是加性的,但这通过缩放损失(
batch_loss = batch_loss / iters_to_accumulate
)来校正。损失缩放确保n
批次累积的组合梯度与单个大批次的梯度具有相同的幅度。