当多个模型同时训练时,PyTorch Lightning无法加载测试的权重

ymdaylpp  于 2023-08-05  发布在  其他
关注(0)|答案(1)|浏览(141)

我想同时对不同的模型运行hyperparametertuning,但只针对以下代码块中的最后一行:

model = model(**params)
trainer = pl.Trainer(accelerator='gpu', devices=1, precision=32, log_every_n_steps=1, max_epochs=1500, 
                            callbacks=[pl.callbacks.ModelCheckpoint(filename = "best", monitor="Validation                    Loss MSE", save_top_k = 3),
                                    pl.callbacks.ModelCheckpoint(save_last =True),pl.callbacks.early_stopping.EarlyStopping(monitor="Validation Loss MSE", patience=50),
                                    ])
tuner = pl.tuner.Tuner(trainer)
tuner.lr_find(model,datamodule)
trainer.fit(model, datamodule)
trainer.fit(model, datamodule)
trainer.test(ckpt_path= str(os.path.abspath(__file__).rsplit('/', 1)[0]) + "/lightning_logs/version_" + str(trainer.logger.version) +"/checkpoints/best.ckpt", datamodule = datamodule)

字符串
我收到错误消息:

RuntimeError: Error(s) in loading state_dict for CNN_LSTM4:
        Missing key(s) in state_dict: "init_h", "init_c", "lstm.weight_ih_l0", "lstm.weight_hh_l0", "lstm.bias_ih_l0", "lstm.bias_hh_l0", "lstm.weight_ih_l1", "lstm.weight_hh_l1", "lstm.bias_ih_l1", "lstm.bias_hh_l1", "linear_layers.3.weight", "linear_layers.3.bias", "linear_layers.3.running_mean", "linear_layers.3.running_var", "linear_layers.9.weight", "linear_layers.9.bias", "linear_layers.9.running_mean", "linear_layers.9.running_var", "linear_layers.15.weight", "linear_layers.15.bias", "linear_layers.15.running_mean", "linear_layers.15.running_var". 
        Unexpected key(s) in state_dict: "linear_layers.20.weight", "linear_layers.20.bias", "linear_layers.20.running_mean", "linear_layers.20.running_var", "linear_layers.20.num_batches_tracked", "linear_layers.24.weight", "linear_layers.24.bias", "linear_layers.2.weight", "linear_layers.2.bias", "linear_layers.2.running_mean", "linear_layers.2.running_var", "linear_layers.2.num_batches_tracked", "linear_layers.8.weight", "linear_layers.8.bias", "linear_layers.8.running_mean", "linear_layers.8.running_var", "linear_layers.8.num_batches_tracked", "linear_layers.14.weight", "linear_layers.14.bias", "linear_layers.14.running_mean", "linear_layers.14.running_var", "linear_layers.14.num_batches_tracked". 
        size mismatch for cnn_layers.0.0.weight: copying a param with shape torch.Size([1003, 1, 19, 24]) from checkpoint, the shape in current model is torch.Size([1023, 1, 21, 24]).
        size mismatch for cnn_layers.0.0.bias: copying a param with shape torch.Size([1003]) from checkpoint, the shape in current model is torch.Size([1023]).
        size mismatch for cnn_layers.0.2.weight: copying a param with shape torch.Size([1003]) from checkpoint, the shape in current model is torch.Size([1023]).
        size mismatch for cnn_layers.0.2.bias: copying a param with shape torch.Size([1003]) from checkpoint, the shape in current model is torch.Size([1023]).


这个错误只发生在我同时为不同的模型运行相同的脚本时。
我想成功地加载每个模型的重量,这样我就可以同时运行多个训练。

tzdcorbm

tzdcorbm1#

我建议问题的作者重构代码,因为不清楚datamodule是什么以及为什么要调用Trainer.fit两次。此外,您应该只使用一个ModelCheckpoint回调。
我还建议给予每个运行一个不同的name,以避免在将检查点写入lightning_logs文件夹时可能出现的冲突。
一个可能的解决方案可能是:

model = model(**params)
trainer = pl.Trainer(
    output_dir=os.path.join('lightning_logs', 'some name for this experiment'),
    accelerator='gpu',
    devices=1,
    precision=32,
    log_every_n_steps=1,
    max_epochs=1500, 
    callbacks=[
        pl.callbacks.ModelCheckpoint(
            monitor="Validation Loss MSE",
            save_top_k=3,
            save_last=True,
        ),
        pl.callbacks.early_stopping.EarlyStopping(
            monitor="Validation Loss MSE",
            patience=50,
        ),
    ]
)

tuner = pl.tuner.Tuner(trainer)
tuner.lr_find(model,datamodule)

trainer.fit(model, datamodule)

trainer.test(ckpt_path="best", datamodule=datamodule)

字符串

相关问题