我刚刚测试了tensorflow 2.0和pytorch 2.0之间的速度(因为我刚刚开始学习pytorch),从我所意识到的来看,在完全相同的模型架构,相同的批处理大小,相同的优化器,并且都运行CUDA的情况下,pytorch的运行时间大约是tf的3倍(tf为1分钟,pyt为3分钟),准确率也更差(tf的验证集为83%,pyt为78%)。
下面是tensorflow的代码:
(train_images, train_labels), (test_images, test_labels) = tf.keras.datasets.fashion_mnist.load_data()
model = tf.keras.Sequential([
tf.keras.layers.Flatten(input_shape=(28,28)),
tf.keras.layers.Dense(512, activation="relu"),
tf.keras.layers.Dense(512, activation="relu"),
tf.keras.layers.Dense(10, activation="softmax")
])
model.compile(optimizer=tf.keras.optimizers.SGD(1e-3), loss="sparse_categorical_crossentropy", metrics=["accuracy"])
model.fit(train_images, train_labels, epochs=20, batch_size=64, validation_data=(test_images, test_labels))
下面是pytorch的代码:
device = torch.device("cuda")
training_data = datasets.FashionMNIST(
root="data",
train=True,
download=True,
transform=ToTensor()
)
test_data = datasets.FashionMNIST(
root="data",
train=False,
download=True,
transform=ToTensor()
)
train_dataloader = DataLoader(training_data, batch_size=64)
test_dataloader = DataLoader(test_data, batch_size=64)
class NeuralNetwork(nn.Module):
def __init__(self):
super().__init__()
self.flatten = nn.Flatten()
self.linear_relu_stack = nn.Sequential(
nn.Linear(28*28, 512),
nn.ReLU(),
nn.Linear(512, 512),
nn.ReLU(),
nn.Linear(512, 10),
)
def forward(self, x):
x = self.flatten(x)
logits = self.linear_relu_stack(x)
return logits
model = NeuralNetwork()
model.to(device)
learning_rate = 1e-3
batch_size = 64
epochs = 5
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
def train_loop(dataloader, model, loss_fn, optimizer):
size = len(dataloader.dataset)
model.train()
for batch, (X, y) in enumerate(dataloader):
X, y = X.to(device), y.to(device)
pred = model(X)
loss = loss_fn(pred, y)
loss.backward()
optimizer.step()
optimizer.zero_grad()
if batch % 100 == 0:
loss, current = loss.item(), (batch + 1) * len(X)
print(f"loss: {loss:>7f} [{current:>5d}/{size:>5d}]")
def test_loop(dataloader, model, loss_fn):
model.eval()
size = len(dataloader.dataset)
num_batches = len(dataloader)
test_loss, correct = 0, 0
with torch.no_grad():
for X, y in dataloader:
X, y = X.to(device), y.to(device)
pred = model(X)
test_loss += loss_fn(pred, y).item()
correct += (pred.argmax(1) == y).type(torch.float).sum().item()
test_loss /= num_batches
correct /= size
print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
epochs = 20
for t in range(epochs):
print(f"Epoch {t+1}\n-------------------------------")
train_loop(train_dataloader, model, loss_fn, optimizer)
test_loop(test_dataloader, model, loss_fn)
print("Done!")
我注意到的是,tensorflow使用了我大约60%的CUDA和我所有的专用GPU内存,而pytorch在0%到30%之间时断时续,根本没有使用多少GPU内存。
This post存在,但这是由于CUDA_LAUNCH_BLOCKING
,这是不存在于我的代码。
规格:
- RTX 3060
- Intel i710700,超频至~4.2GHz
- 64 gb记忆
编辑:我尝试过为数据加载器提升工作线程和固定内存,但20个epoch只节省了15秒左右,torch.backends.cudnn.benchmark = True
根本没有帮助。
2条答案
按热度按时间a14dhokn1#
我已经将您的代码从Tensorflow转换为Pytorch,它的工作速度更快,因此请检查代码的版本并在您的设备上使用它。
gcxthw6b2#
当我运行一个测试来优化this medium article中概述的num_workers时,我发现是我的DataLoader拖慢了我的速度。在浏览了文档之后,我遇到了
persistent_workers=True
,它为我解决了这个问题。代码从~3分钟缩短到~25秒。