我刚刚建立了一个新的基于AMD的PC,
- CPU - AMD锐龙7 3700 X,
- GPU - AMD Radeon RX Vega 56,
- 操作系统- Ubuntu 18.04.
为了使用AMD GPU for Tensorflow,我按照这两个安装ROCm。一切看起来都很好,安装没有问题。我想我安装了ROCM 3。我做的和海报上的一模一样。
- https://towardsdatascience.com/train-neural-networks-using-amd-gpus-and-keras-37189c453878
- https://www.videogames.ai/Install-ROCM-Machine-Learning-AMD-GPU
- 视频链接:https://www.youtube.com/watch?v=fkSRkAoMS4g
但是当我在终端运行rocm-bandwidth-test
时,作为视频,我得到了如下结果。
(base) nick@nick-nbpc:~$ rocm-bandwidth-test
........
RocmBandwidthTest Version: 2.3.11
Launch Command is: rocm-bandwidth-test (rocm_bandwidth -a + rocm_bandwidth -A)
Device: 0, AMD Ryzen 7 3700X 8-Core Processor
Device: 1, Vega 10 XT [Radeon RX Vega 64], 2f:0.0
Inter-Device Access
D/D 0 1
0 1 0
1 1 1
Inter-Device Numa Distance
D/D 0 1
0 0 N/A
1 20 0
Unidirectional copy peak bandwidth GB/s
D/D 0 1
0 N/A 9.295924
1 8.892247 72.654038
Bdirectional copy peak bandwidth GB/s
D/D 0 1
0 N/A 17.103560
1 17.103560 N/A
该视频使用AMD RX 580 GPU,我比较了下面链接的技术规格。https://www.youtube.com/watch?v=shstdFZJJ_o显示RX580的内存带宽为256 Gb/s,Vega 56的内存带宽为409.6 Gb/s。在另一个视频中,上传器在视频的时间11:09处具有195 Gb/s的带宽。但是我的Vega 56只有72.5 Gb/s!这是一个巨大的差异。我不知道怎么了。
然后安装Python 3.6和TensorFlow-ROCm。我克隆了tensorflow/benchmarks
,就像视频一样,在TensorFlow中进行基准测试。
执行代码:
$ python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50
给出以下结果:
Done warm up
Step Img/sec total_loss
1 images/sec: 81.0 +/- 0.0 (jitter = 0.0) 7.765
10 images/sec: 80.7 +/- 0.1 (jitter = 0.2) 8.049
20 images/sec: 80.7 +/- 0.0 (jitter = 0.1) 7.808
30 images/sec: 80.7 +/- 0.0 (jitter = 0.1) 7.976
40 images/sec: 80.9 +/- 0.1 (jitter = 0.2) 7.591
50 images/sec: 81.2 +/- 0.1 (jitter = 0.3) 7.549
60 images/sec: 81.5 +/- 0.1 (jitter = 0.6) 7.819
70 images/sec: 81.7 +/- 0.1 (jitter = 1.1) 7.820
80 images/sec: 81.8 +/- 0.1 (jitter = 1.5) 7.847
90 images/sec: 82.0 +/- 0.1 (jitter = 0.8) 8.025
100 images/sec: 82.1 +/- 0.1 (jitter = 0.6) 8.029
----------------------------------------------------------------
total images/sec: 82.07
----------------------------------------------------------------
结果没有我预期的那么好。我期待着100+。但由于我对Ubuntu/AMD/TensorFlow的了解有限,我很可能是错的。如果没有,有人能告诉我为什么我的带宽没有400 Gb/s那么快吗?
clinfo
:
(base) nick@nick-nbpc:~$ clinfo
Number of platforms: 1
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 2.0 AMD-APP (3137.0)
Platform Name: AMD Accelerated Parallel Processing
Platform Vendor: Advanced Micro Devices, Inc.
Platform Extensions: cl_khr_icd cl_amd_event_callback
Platform Name: AMD Accelerated Parallel Processing
Number of devices: 1
Device Type: CL_DEVICE_TYPE_GPU
Vendor ID: 1002h
Board name: Vega 10 XT [Radeon RX Vega 64]
Device Topology: PCI[ B#47, D#0, F#0 ]
Max compute units: 56
Max work items dimensions: 3
Max work items[0]: 1024
Max work items[1]: 1024
Max work items[2]: 1024
Max work group size: 256
Preferred vector width char: 4
Preferred vector width short: 2
Preferred vector width int: 1
Preferred vector width long: 1
Preferred vector width float: 1
Preferred vector width double: 1
Native vector width char: 4
Native vector width short: 2
Native vector width int: 1
Native vector width long: 1
Native vector width float: 1
Native vector width double: 1
Max clock frequency: 1590Mhz
Address bits: 64
Max memory allocation: 7287183769
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 8
Max image 2D width: 16384
Max image 2D height: 16384
Max image 3D width: 2048
Max image 3D height: 2048
Max image 3D depth: 2048
Max samplers within kernel: 26751
Max size of kernel argument: 1024
Alignment (bits) of base address: 1024
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: Yes
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: Yes
Round to +ve and infinity: Yes
IEEE754-2008 fused multiply-add: Yes
Cache type: Read/Write
Cache line size: 64
Cache size: 16384
Global memory size: 8573157376
Constant buffer size: 7287183769
Max number of constant args: 8
Local memory type: Scratchpad
Local memory size: 65536
Max pipe arguments: 16
Max pipe active reservations: 16
Max pipe packet size: 2992216473
Max global variable size: 7287183769
Max global variable preferred total size: 8573157376
Max read/write image args: 64
Max on device events: 1024
Queue on device max size: 8388608
Max on device queues: 1
Queue on device preferred size: 262144
SVM capabilities:
Coarse grain buffer: Yes
Fine grain buffer: Yes
Fine grain system: No
Atomics: No
Preferred platform atomic alignment: 0
Preferred global atomic alignment: 0
Preferred local atomic alignment: 0
Kernel Preferred work group size multiple: 64
Error correction support: 0
Unified memory for Host and Device: 0
Profiling timer resolution: 1
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: No
Queue on Host properties:
Out-of-Order: No
Profiling : Yes
Queue on Device properties:
Out-of-Order: Yes
Profiling : Yes
Platform ID: 0x7fe56aa5fcf0
Name: gfx900
Vendor: Advanced Micro Devices, Inc.
Device OpenCL C version: OpenCL C 2.0
Driver version: 3137.0 (HSA1.1,LC)
Profile: FULL_PROFILE
Version: OpenCL 2.0
Extensions: cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program
rocminfo
:
(base) nick@nick-nbpc:~$ rocminfo
ROCk module is loaded
Able to open /dev/kfd read-write
=====================
HSA System Attributes
=====================
Runtime Version: 1.1
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
==========
HSA Agents
==========
*******
Agent 1
*******
Name: AMD Ryzen 7 3700X 8-Core Processor
Uuid: CPU-XX
Marketing Name: AMD Ryzen 7 3700X 8-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 0
BDFID: 0
Internal Node ID: 0
Compute Unit: 16
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 16436616(0xfacd88) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 16436616(0xfacd88) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
N/A
*******
Agent 2
*******
Name: gfx900
Uuid: GPU-02151e1bb9ee2144
Marketing Name: Vega 10 XT [Radeon RX Vega 64]
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 4096(0x1000)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
Chip ID: 26751(0x687f)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 1590
BDFID: 12032
Internal Node ID: 1
Compute Unit: 56
SIMDs per CU: 4
Shader Engines: 4
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Features: KERNEL_DISPATCH
Fast F16 Operation: FALSE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 40(0x28)
Max Work-item Per CU: 2560(0xa00)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 8372224(0x7fc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx900
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***
1条答案
按热度按时间8ftvxx2r1#
我不能回答带宽问题,但我刚刚尝试了相同的基准(根据YouTube视频)
我得到:
和你的一样但是:
python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50
给了我:
客户信息
罗明福
唯一不同的是
执行代码:python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50
在Python 2中执行测试。(也许这只是一个错字)
欢迎gspeet