TensorFlow 2模型预测速度慢

s4n0splo  于 2023-08-06  发布在  其他
关注(0)|答案(1)|浏览(121)

最初我把这个问题描述为gpu没有被使用。但现在起作用了。列车相位比以前更快。现在是4秒,但我写问题的时候是17秒。根据批次和样本大小,GPU的使用率约为10-20%。我几年前用过TF 1,现在应该也能用了。
预测比训练慢,我只使用了舞蹈层,所以我不确定在预测过程中是否使用了GPU。我将添加更多的层,所以我真的很想知道,如果这是如何TF 2的工作或如果有办法加快事情。
GPU:GTX 1070
Python 3.9
很快

  • 我已经安装了Win10 gpu支持的兼容库,即TF<2.11
  • 尝试TF=2.9.x和2.10.x无差异。
  • 在conda环境中使用conda安装了CudatoolkitCudnn,以及兼容的版本

环境设置

conda create -n tf6 python==3.9.16
conda activate tf6
conda install cudatoolkit=11.2.2 -c conda-forge
conda install cudnn=8.1.0 -c conda-forge
pip install tensorflow-gpu==2.9.3

字符串

GPU检测

每当我做这个测试,我的gpu有使用峰值高达100%

import keras
import tensorflow as tf
import tensorflow.keras as k2

# from tensorflow.

# from keras import tensorflow as tf

# assert tf.test.is_gpu_available(), "NO GPU"
# print(f"GPU: {tf.test.is_gpu_available()}")
# print(f"GPU: {tf.test.is_gpu_available(cuda_only=False)}")
# print(f"GPU: {tf.test.is_gpu_available(min_cuda_compute_capability=False)}")
# print(f"GPU: {tf.test.is_gpu_available(min_cuda_compute_capability=True)}")


print("CPU LIST:", tf.config.list_physical_devices("CPU"))
print("GPU LIST:", tf.config.list_physical_devices("GPU"))
# print("GPU AVAILABLE:", tf.test.is_gpu_available()) # Deprecated
print("Deprecated AVAILABLE:", tf.test.is_gpu_available())  # Deprecated
print("Deprecated AVAILABLE (noCuda):", tf.test.is_gpu_available(cuda_only=False))  # Deprecated
print("Deprecated AVAILABLE (Cuda):", tf.test.is_gpu_available(cuda_only=True))  # Deprecated
print("BUILD WITH CUDA:", tf.test.is_built_with_cuda())  # Installed non gpu package

from tensorflow.python.client import device_lib

print("=== " * 6)
print("LOCAL DEVICES:")
print(device_lib.list_local_devices())
CPU LIST: [PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]
GPU LIST: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
WARNING:tensorflow:From P:\LocalPrograms\stock\friendly_solution_23-07\modules\Check_Env_GPU.py:20: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
2023-07-06 01:43:49.051838: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Deprecated AVAILABLE: True
Deprecated AVAILABLE (noCuda): True
Deprecated AVAILABLE (Cuda): True
BUILD WITH CUDA: True
=== === === === === === 
LOCAL DEVICES:
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 12081208059895952385
xla_global_id: -1
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 6957301760
locality {
  bus_id: 1
  links {
  }
}
incarnation: 16955984514207249187
physical_device_desc: "device: 0, name: NVIDIA GeForce GTX 1070, pci bus id: 0000:01:00.0, compute capability: 6.1"
xla_global_id: 416903419
]
2023-07-06 01:43:49.612956: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /device:GPU:0 with 6635 MB memory:  -> device: 0, name: NVIDIA GeForce GTX 1070, pci bus id: 0000:01:00.0, compute capability: 6.1
2023-07-06 01:43:49.614541: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /device:GPU:0 with 6635 MB memory:  -> device: 0, name: NVIDIA GeForce GTX 1070, pci bus id: 0000:01:00.0, compute capability: 6.1
2023-07-06 01:43:49.615236: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /device:GPU:0 with 6635 MB memory:  -> device: 0, name: NVIDIA GeForce GTX 1070, pci bus id: 0000:01:00.0, compute capability: 6.1
2023-07-06 01:43:49.615928: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /device:GPU:0 with 6635 MB memory:  -> device: 0, name: NVIDIA GeForce GTX 1070, pci bus id: 0000:01:00.0, compute capability: 6.1

的数据

小基准

GPU未被利用,百分比非常低

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LSTM, Flatten
from tensorflow.keras.layers import ConvLSTM2D

import numpy as np
import time

# import keras

import tensorflow as tf

size = int(3e4)
N = 500
W_nodes = 600

X = np.random.random((size, N))
Y = np.random.random(size)

print("\n" * 3)
print("Testing model (no session)")
model = Sequential()
model.add(Dense(W_nodes, input_shape=(N,)))
model.add(Dense(W_nodes))
model.add(Dense(W_nodes))
model.add(Dense(W_nodes))
model.add(Dense(W_nodes))
model.add(Dense(1))
model.compile(optimizer='adam', loss='mean_squared_error', metrics=['accuracy'])

print(f"Fitting X:{X.shape}, Y:{Y.shape}")
t0 = time.time()
model.fit(X, Y, verbose=True, epochs=1, batch_size=10)
model.predict(X)
tend = time.time()
print(f"Time fit i predict: {tend - t0}")

####################

"This was faster for me in other environment, where no session took 17s, and this was around 4"
print("Checking Compat V1 Session")
import tensorflow as tf

with tf.compat.v1.Session():
    model = Sequential()
    model.add(Dense(W_nodes, input_shape=(N,)))
    model.add(Dense(W_nodes))
    model.add(Dense(W_nodes))
    model.add(Dense(W_nodes))
    model.add(Dense(W_nodes))
    model.add(Dense(1))
    model.compile(optimizer='adam', loss='mean_squared_error', metrics=['accuracy'])

    print(f"Fitting X:{X.shape}, Y:{Y.shape}")
    t0 = time.time()
    model.fit(X, Y, verbose=True, epochs=1, )
    model.predict(X, verbose=True)
    tend = time.time()
    print(f"Time fit and predict: {tend - t0}")
Testing model (no session)
2023-07-06 02:30:17.633503: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-07-06 02:30:18.150568: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 6635 MB memory:  -> device: 0, name: NVIDIA GeForce GTX 1070, pci bus id: 0000:01:00.0, compute capability: 6.1
Fitting X:(100000, 500), Y:(100000,)
1000/1000 [==============================] - 6s 5ms/step - loss: 4.0560 - accuracy: 0.0000e+00
3125/3125 [==============================] - 4s 1ms/step

Time fit i predict: 11.233973741531372
Checking Compat V1 Session
2023-07-06 02:30:29.674981: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 6635 MB memory:  -> device: 0, name: NVIDIA GeForce GTX 1070, pci bus id: 0000:01:00.0, compute capability: 6.1
Fitting X:(100000, 500), Y:(100000,)
Train on 100000 samples
2023-07-06 02:30:30.064769: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:354] MLIR V1 optimization pass is not enabled
100000/100000 [==============================] - 13s 128us/sample - loss: 1.1066 - accuracy: 0.0000e+00
C:\Users\Greg\anaconda3\envs\tf6\lib\site-packages\keras\engine\training_v1.py:2067: UserWarning: `Model.state_updates` will be removed in a future version. This property should not be used in TensorFlow 2.0, as `updates` are applied automatically.
  updates=self.state_updates,
Time fit i predict: 17.37589716911316

的字符串

qyswt5oh

qyswt5oh1#

我认为你正在用你的time()块来测量训练预测时间。
t0 = time.time()
model.fit(X,Y,verbose=True,epochs=1,batch_size=10)
model.predict(X)
tend = time.time()
正如预期的那样,您可以看到TF 2中的训练(6s)比TF 1(13 s)快,而预测在两个时间内都在4-5s左右。通过训练和预测,TF 2得到11 s(6s+5s),TF 1得到17 s(13 s +4s)。
另请注意,您的两个示例具有不同的批大小,看起来您的第二次拟合运行的批大小为1,原因是:

100000/100000[=

这也解释了13秒的训练时间。
你的网络和批量大小10都很小,不应该使用整个GPU。您可以尝试更多/更大的图层、更多的数据和更大的批处理大小。
关于定时训练的最后一个注意事项,由于初始化开销和其他后台任务,第一个拟合时期通常会比下一个时期花费更长的时间。

相关问题