tensorflow Vega 56带宽慢(不仅)TensowFlow -为什么和如何?

xqkwcwgp  于 2023-10-23  发布在  其他
关注(0)|答案(1)|浏览(72)

我刚刚建立了一个新的基于AMD的PC,

  • CPU - AMD锐龙7 3700 X,
  • GPU - AMD Radeon RX Vega 56,
  • 操作系统- Ubuntu 18.04.

为了使用AMD GPU for Tensorflow,我按照这两个安装ROCm。一切看起来都很好,安装没有问题。我想我安装了ROCM 3。我做的和海报上的一模一样。

但是当我在终端运行rocm-bandwidth-test时,作为视频,我得到了如下结果。

(base) nick@nick-nbpc:~$ rocm-bandwidth-test
........
          RocmBandwidthTest Version: 2.3.11

          Launch Command is: rocm-bandwidth-test (rocm_bandwidth -a + rocm_bandwidth -A)

          Device: 0,  AMD Ryzen 7 3700X 8-Core Processor
          Device: 1,  Vega 10 XT [Radeon RX Vega 64],  2f:0.0

          Inter-Device Access

          D/D       0         1         

          0         1         0         

          1         1         1         

          Inter-Device Numa Distance

          D/D       0         1         

          0         0         N/A       

          1         20        0         

          Unidirectional copy peak bandwidth GB/s

          D/D       0           1           

          0         N/A         9.295924    

          1         8.892247    72.654038   

          Bdirectional copy peak bandwidth GB/s

          D/D       0           1           

          0         N/A         17.103560   

          1         17.103560   N/A

该视频使用AMD RX 580 GPU,我比较了下面链接的技术规格。https://www.youtube.com/watch?v=shstdFZJJ_o显示RX580的内存带宽为256 Gb/s,Vega 56的内存带宽为409.6 Gb/s。在另一个视频中,上传器在视频的时间11:09处具有195 Gb/s的带宽。但是我的Vega 56只有72.5 Gb/s!这是一个巨大的差异。我不知道怎么了。
然后安装Python 3.6和TensorFlow-ROCm。我克隆了tensorflow/benchmarks,就像视频一样,在TensorFlow中进行基准测试。
执行代码:

$ python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50

给出以下结果:

Done warm up
Step    Img/sec total_loss
1   images/sec: 81.0 +/- 0.0 (jitter = 0.0) 7.765
10  images/sec: 80.7 +/- 0.1 (jitter = 0.2) 8.049
20  images/sec: 80.7 +/- 0.0 (jitter = 0.1) 7.808
30  images/sec: 80.7 +/- 0.0 (jitter = 0.1) 7.976
40  images/sec: 80.9 +/- 0.1 (jitter = 0.2) 7.591
50  images/sec: 81.2 +/- 0.1 (jitter = 0.3) 7.549
60  images/sec: 81.5 +/- 0.1 (jitter = 0.6) 7.819
70  images/sec: 81.7 +/- 0.1 (jitter = 1.1) 7.820
80  images/sec: 81.8 +/- 0.1 (jitter = 1.5) 7.847
90  images/sec: 82.0 +/- 0.1 (jitter = 0.8) 8.025
100 images/sec: 82.1 +/- 0.1 (jitter = 0.6) 8.029
----------------------------------------------------------------
total images/sec: 82.07
----------------------------------------------------------------

结果没有我预期的那么好。我期待着100+。但由于我对Ubuntu/AMD/TensorFlow的了解有限,我很可能是错的。如果没有,有人能告诉我为什么我的带宽没有400 Gb/s那么快吗?

clinfo

(base) nick@nick-nbpc:~$ clinfo
Number of platforms:                 1
  Platform Profile:              FULL_PROFILE
  Platform Version:              OpenCL 2.0 AMD-APP (3137.0)
  Platform Name:                 AMD Accelerated Parallel Processing
  Platform Vendor:               Advanced Micro Devices, Inc.
  Platform Extensions:               cl_khr_icd cl_amd_event_callback 

  Platform Name:                 AMD Accelerated Parallel Processing
Number of devices:               1
  Device Type:                   CL_DEVICE_TYPE_GPU
  Vendor ID:                     1002h
  Board name:                    Vega 10 XT [Radeon RX Vega 64]
  Device Topology:               PCI[ B#47, D#0, F#0 ]
  Max compute units:                 56
  Max work items dimensions:             3
    Max work items[0]:               1024
    Max work items[1]:               1024
    Max work items[2]:               1024
  Max work group size:               256
  Preferred vector width char:           4
  Preferred vector width short:          2
  Preferred vector width int:            1
  Preferred vector width long:           1
  Preferred vector width float:          1
  Preferred vector width double:         1
  Native vector width char:          4
  Native vector width short:             2
  Native vector width int:           1
  Native vector width long:          1
  Native vector width float:             1
  Native vector width double:            1
  Max clock frequency:               1590Mhz
  Address bits:                  64
  Max memory allocation:             7287183769
  Image support:                 Yes
  Max number of images read arguments:       128
  Max number of images write arguments:      8
  Max image 2D width:                16384
  Max image 2D height:               16384
  Max image 3D width:                2048
  Max image 3D height:               2048
  Max image 3D depth:                2048
  Max samplers within kernel:            26751
  Max size of kernel argument:           1024
  Alignment (bits) of base address:      1024
  Minimum alignment (bytes) for any datatype:    128
  Single precision floating point capability
    Denorms:                     Yes
    Quiet NaNs:                  Yes
    Round to nearest even:           Yes
    Round to zero:               Yes
    Round to +ve and infinity:           Yes
    IEEE754-2008 fused multiply-add:         Yes
  Cache type:                    Read/Write
  Cache line size:               64
  Cache size:                    16384
  Global memory size:                8573157376
  Constant buffer size:              7287183769
  Max number of constant args:           8
  Local memory type:                 Scratchpad
  Local memory size:                 65536
  Max pipe arguments:                16
  Max pipe active reservations:          16
  Max pipe packet size:              2992216473
  Max global variable size:          7287183769
  Max global variable preferred total size:  8573157376
  Max read/write image args:             64
  Max on device events:              1024
  Queue on device max size:          8388608
  Max on device queues:              1
  Queue on device preferred size:        262144
  SVM capabilities:              
    Coarse grain buffer:             Yes
    Fine grain buffer:               Yes
    Fine grain system:               No
    Atomics:                     No
  Preferred platform atomic alignment:       0
  Preferred global atomic alignment:         0
  Preferred local atomic alignment:      0
  Kernel Preferred work group size multiple:     64
  Error correction support:          0
  Unified memory for Host and Device:        0
  Profiling timer resolution:            1
  Device endianess:              Little
  Available:                     Yes
  Compiler available:                Yes
  Execution capabilities:                
    Execute OpenCL kernels:          Yes
    Execute native function:             No
  Queue on Host properties:              
    Out-of-Order:                No
    Profiling :                  Yes
  Queue on Device properties:                
    Out-of-Order:                Yes
    Profiling :                  Yes
  Platform ID:                   0x7fe56aa5fcf0
  Name:                      gfx900
  Vendor:                    Advanced Micro Devices, Inc.
  Device OpenCL C version:           OpenCL C 2.0 
  Driver version:                3137.0 (HSA1.1,LC)
  Profile:                   FULL_PROFILE
  Version:                   OpenCL 2.0 
  Extensions:                    cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program

rocminfo

(base) nick@nick-nbpc:~$ rocminfo
ROCk module is loaded
Able to open /dev/kfd read-write
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen 7 3700X 8-Core Processor 
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen 7 3700X 8-Core Processor 
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   0                                  
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            16                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    16436616(0xfacd88) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    16436616(0xfacd88) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
    N/A                      
*******                  
Agent 2                  
*******                  
  Name:                    gfx900                             
  Uuid:                    GPU-02151e1bb9ee2144               
  Marketing Name:          Vega 10 XT [Radeon RX Vega 64]     
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          4096(0x1000)                       
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
  Chip ID:                 26751(0x687f)                      
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   1590                               
  BDFID:                   12032                              
  Internal Node ID:        1                                  
  Compute Unit:            56                                 
  SIMDs per CU:            4                                  
  Shader Engines:          4                                  
  Shader Arrs. per Eng.:   1                                  
  WatchPts on Addr. Ranges:4                                  
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      FALSE                              
  Wavefront Size:          64(0x40)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        40(0x28)                           
  Max Work-item Per CU:    2560(0xa00)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    8372224(0x7fc000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx900          
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***
8ftvxx2r

8ftvxx2r1#

我不能回答带宽问题,但我刚刚尝试了相同的基准(根据YouTube视频)
我得到:

(vrocm1) user1@t1000test:~$ rocm-bandwidth-test
........
          RocmBandwidthTest Version: 2.3.11

          Launch Command is: rocm-bandwidth-test (rocm_bandwidth -a + rocm_bandwidth -A)

          Device: 0,  AMD Ryzen 7 2700X Eight-Core Processor
          Device: 1,  Vega 10 XL/XT [Radeon RX Vega 56/64],  28:0.0

          Inter-Device Access

          D/D       0         1         

          0         1         0         

          1         1         1         

          Inter-Device Numa Distance

          D/D       0         1         

          0         0         N/A       

          1         20        0         

          Unidirectional copy peak bandwidth GB/s

          D/D       0           1           

          0         N/A         9.542044    

          1         9.028717    72.202459   

          Bdirectional copy peak bandwidth GB/s

          D/D       0           1           

          0         N/A         17.144430   

          1         17.144430   N/A

和你的一样但是:
python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50
给了我:

Done warm up
Step    Img/sec total_loss
1   images/sec: 172.0 +/- 0.0 (jitter = 0.0)    7.765
10  images/sec: 172.5 +/- 0.1 (jitter = 0.6)    8.049
20  images/sec: 172.6 +/- 0.1 (jitter = 0.4)    7.808
30  images/sec: 172.5 +/- 0.1 (jitter = 0.6)    7.976
40  images/sec: 172.6 +/- 0.1 (jitter = 0.5)    7.591
50  images/sec: 172.5 +/- 0.1 (jitter = 0.6)    7.549
60  images/sec: 172.6 +/- 0.1 (jitter = 0.5)    7.819
70  images/sec: 172.6 +/- 0.1 (jitter = 0.5)    7.819
80  images/sec: 172.6 +/- 0.1 (jitter = 0.5)    7.848
90  images/sec: 172.6 +/- 0.0 (jitter = 0.5)    8.025
100 images/sec: 172.5 +/- 0.0 (jitter = 0.5)    8.029
----------------------------------------------------------------
total images/sec: 172.39
----------------------------------------------------------------

客户信息

Number of platforms:                 1
  Platform Profile:              FULL_PROFILE
  Platform Version:              OpenCL 2.0 AMD-APP (3182.0)
  Platform Name:                 AMD Accelerated Parallel Processing
  Platform Vendor:               Advanced Micro Devices, Inc.
  Platform Extensions:               cl_khr_icd cl_amd_event_callback 

  Platform Name:                 AMD Accelerated Parallel Processing
Number of devices:               1
  Device Type:                   CL_DEVICE_TYPE_GPU
  Vendor ID:                     1002h
  Board name:                    Vega 10 XL/XT [Radeon RX Vega 56/64]
  Device Topology:               PCI[ B#40, D#0, F#0 ]
  Max compute units:                 64
  Max work items dimensions:             3
    Max work items[0]:               1024
    Max work items[1]:               1024
    Max work items[2]:               1024
  Max work group size:               256
  Preferred vector width char:           4
  Preferred vector width short:          2
  Preferred vector width int:            1
  Preferred vector width long:           1
  Preferred vector width float:          1
  Preferred vector width double:         1
  Native vector width char:          4
  Native vector width short:             2
  Native vector width int:           1
  Native vector width long:          1
  Native vector width float:             1
  Native vector width double:            1
  Max clock frequency:               1630Mhz
  Address bits:                  64
  Max memory allocation:             7287183769
  Image support:                 Yes
  Max number of images read arguments:       128
  Max number of images write arguments:      8
  Max image 2D width:                16384
  Max image 2D height:               16384
  Max image 3D width:                2048
  Max image 3D height:               2048
  Max image 3D depth:                2048
  Max samplers within kernel:            26751
  Max size of kernel argument:           1024
  Alignment (bits) of base address:      1024
  Minimum alignment (bytes) for any datatype:    128
  Single precision floating point capability
    Denorms:                     Yes
    Quiet NaNs:                  Yes
    Round to nearest even:           Yes
    Round to zero:               Yes
    Round to +ve and infinity:           Yes
    IEEE754-2008 fused multiply-add:         Yes
  Cache type:                    Read/Write
  Cache line size:               64
  Cache size:                    16384
  Global memory size:                8573157376
  Constant buffer size:              7287183769
  Max number of constant args:           8
  Local memory type:                 Scratchpad
  Local memory size:                 65536
  Max pipe arguments:                16
  Max pipe active reservations:          16
  Max pipe packet size:              2992216473
  Max global variable size:          7287183769
  Max global variable preferred total size:  8573157376
  Max read/write image args:             64
  Max on device events:              1024
  Queue on device max size:          8388608
  Max on device queues:              1
  Queue on device preferred size:        262144
  SVM capabilities:              
    Coarse grain buffer:             Yes
    Fine grain buffer:               Yes
    Fine grain system:               No
    Atomics:                     No
  Preferred platform atomic alignment:       0
  Preferred global atomic alignment:         0
  Preferred local atomic alignment:      0
  Kernel Preferred work group size multiple:     64
  Error correction support:          0
  Unified memory for Host and Device:        0
  Profiling timer resolution:            1
  Device endianess:              Little
  Available:                     Yes
  Compiler available:                Yes
  Execution capabilities:                
    Execute OpenCL kernels:          Yes
    Execute native function:             No
  Queue on Host properties:              
    Out-of-Order:                No
    Profiling :                  Yes
  Queue on Device properties:                
    Out-of-Order:                Yes
    Profiling :                  Yes
  Platform ID:                   0x7efe04b66cd0
  Name:                      gfx900
  Vendor:                    Advanced Micro Devices, Inc.
  Device OpenCL C version:           OpenCL C 2.0 
  Driver version:                3182.0 (HSA1.1,LC)
  Profile:                   FULL_PROFILE
  Version:                   OpenCL 2.0 
  Extensions:                    cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program

罗明福

ROCk module is loaded
Able to open /dev/kfd read-write
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen 7 2700X Eight-Core Processor
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen 7 2700X Eight-Core Processor
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   3700                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            16                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    32898020(0x1f5fbe4) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    32898020(0x1f5fbe4) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
    N/A                      
*******                  
Agent 2                  
*******                  
  Name:                    gfx900                             
  Uuid:                    GPU-021508a5025618e4               
  Marketing Name:          Vega 10 XL/XT [Radeon RX Vega 56/64]
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          4096(0x1000)                       
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
  Chip ID:                 26751(0x687f)                      
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   1630                               
  BDFID:                   10240                              
  Internal Node ID:        1                                  
  Compute Unit:            64                                 
  SIMDs per CU:            4                                  
  Shader Engines:          4                                  
  Shader Arrs. per Eng.:   1                                  
  WatchPts on Addr. Ranges:4                                  
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      FALSE                              
  Wavefront Size:          64(0x40)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        40(0x28)                           
  Max Work-item Per CU:    2560(0xa00)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    8372224(0x7fc000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx900          
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***

唯一不同的是
执行代码:python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50
在Python 2中执行测试。(也许这只是一个错字)
欢迎gspeet

相关问题