spark未利用gpu taskresourceassignmentsMap(gpu->[0]

v1l68za4  于 2021-07-09  发布在  Spark
关注(0)|答案(0)|浏览(246)

我看到任务被划分到gpu,但是gpu的利用率是0%。我怎样才能得到使用gpu的工作?我在独立模式下在gpu服务器上运行主服务器和1个工作服务器。
spark提交

spark-submit                                                                               \
--master spark://<ip>:7077                                                                \
--conf spark.executor.resource.gpu.discoveryScript=/opt/getGpusResources.sh                \
--conf spark.worker.resource.gpu.discoveryScript=/opt/getGpusResources.sh                  \
--conf spark.task.resource.gpu.amount=1                                                   \
--conf spark.executor.resource.gpu.amount=1                                                \
--conf spark.worker.resource.gpu.amount=1                                                  \
--class com.spark.Class                                                                   \
app.jar

日志

21/03/30 23:19:25 INFO DAGScheduler: Submitting 10 missing tasks from ShuffleMapStage 251 (MapPartitionsRDD[306] at collect at ClusteringMetrics.scala:102) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9))
21/03/30 23:19:25 INFO TaskSchedulerImpl: Adding task set 251.0 with 10 tasks resource profile 0
21/03/30 23:19:25 INFO TaskSetManager: Starting task 0.0 in stage 251.0 (TID 2178) (<ip>, executor 0, partition 0, NODE_LOCAL, 4446 bytes) taskResourceAssignments Map(gpu -> [name: gpu, addresses: 0])
21/03/30 23:19:25 INFO BlockManagerInfo: Added broadcast_319_piece0 in memory on <ip>:34559 (size: 17.3 KiB, free: 4.0 GiB)
21/03/30 23:19:25 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 83 to <ip>:34520
21/03/30 23:19:25 INFO BlockManagerInfo: Added broadcast_316_piece0 in memory on <ip>:34559 (size: 547.0 B, free: 4.0 GiB)
21/03/30 23:19:25 INFO TaskSetManager: Starting task 1.0 in stage 251.0 (TID 2179) (<ip>, executor 0, partition 1, NODE_LOCAL, 4446 bytes) taskResourceAssignments Map(gpu -> [name: gpu, addresses: 0])
21/03/30 23:19:25 INFO TaskSetManager: Finished task 0.0 in stage 251.0 (TID 2178) in 225 ms on <ip> (executor 0) (1/10)
21/03/30 23:19:25 INFO TaskSetManager: Starting task 2.0 in stage 251.0 (TID 2180) (<ip>, executor 0, partition 2, NODE_LOCAL, 4446 bytes) taskResourceAssignments Map(gpu -> [name: gpu, addresses: 0])
21/03/30 23:19:25 INFO TaskSetManager: Finished task 1.0 in stage 251.0 (TID 2179) in 181 ms on <ip> (executor 0) (2/10)
21/03/30 23:19:26 INFO TaskSetManager: Starting task 3.0 in stage 251.0 (TID 2181) (<ip>, executor 0, partition 3, NODE_LOCAL, 4446 bytes) taskResourceAssignments Map(gpu -> [name: gpu, addresses: 0])
21/03/30 23:19:26 INFO TaskSetManager: Finished task 2.0 in stage 251.0 (TID 2180) in 226 ms on <ip> (executor 0) (3/10)
21/03/30 23:19:26 INFO TaskSetManager: Starting task 4.0 in stage 251.0 (TID 2182) (<ip>, executor 0, partition 4, NODE_LOCAL, 4446 bytes) taskResourceAssignments Map(gpu -> [name: gpu, addresses: 0])
21/03/30 23:19:26 INFO TaskSetManager: Finished task 3.0 in stage 251.0 (TID 2181) in 187 ms on <ip> (executor 0) (4/10)
21/03/30 23:19:26 INFO TaskSetManager: Starting task 5.0 in stage 251.0 (TID 2183) (<ip>, executor 0, partition 5, NODE_LOCAL, 4446 bytes) taskResourceAssignments Map(gpu -> [name: gpu, addresses: 0])
21/03/30 23:19:26 INFO TaskSetManager: Finished task 4.0 in stage 251.0 (TID 2182) in 180 ms on <ip> (executor 0) (5/10)
21/03/30 23:19:26 INFO TaskSetManager: Starting task 6.0 in stage 251.0 (TID 2184) (<ip>, executor 0, partition 6, NODE_LOCAL, 4446 bytes) taskResourceAssignments Map(gpu -> [name: gpu, addresses: 0])
21/03/30 23:19:26 INFO TaskSetManager: Finished task 5.0 in stage 251.0 (TID 2183) in 179 ms on <ip> (executor 0) (6/10)
21/03/30 23:19:26 INFO TaskSetManager: Starting task 7.0 in stage 251.0 (TID 2185) (<ip>, executor 0, partition 7, NODE_LOCAL, 4446 bytes) taskResourceAssignments Map(gpu -> [name: gpu, addresses: 0])
21/03/30 23:19:26 INFO TaskSetManager: Finished task 6.0 in stage 251.0 (TID 2184) in 179 ms on <ip> (executor 0) (7/10)
21/03/30 23:19:27 INFO TaskSetManager: Starting task 8.0 in stage 251.0 (TID 2186) (<ip>, executor 0, partition 8, NODE_LOCAL, 4446 bytes) taskResourceAssignments Map(gpu -> [name: gpu, addresses: 0])
21/03/30 23:19:27 INFO TaskSetManager: Finished task 7.0 in stage 251.0 (TID 2185) in 216 ms on <ip> (executor 0) (8/10)
21/03/30 23:19:27 INFO TaskSetManager: Starting task 9.0 in stage 251.0 (TID 2187) (<ip>, executor 0, partition 9, NODE_LOCAL, 4446 bytes) taskResourceAssignments Map(gpu -> [name: gpu, addresses: 0])
21/03/30 23:19:27 INFO TaskSetManager: Finished task 8.0 in stage 251.0 (TID 2186) in 179 ms on <ip> (executor 0) (9/10)
21/03/30 23:19:27 INFO TaskSetManager: Finished task 9.0 in stage 251.0 (TID 2187) in 179 ms on <ip> (executor 0) (10/10)
21/03/30 23:19:27 INFO TaskSchedulerImpl: Removed TaskSet 251.0, whose tasks have all completed, from pool 
21/03/30 23:19:27 INFO DAGScheduler: ShuffleMapStage 251 (collect at ClusteringMetrics.scala:102) finished in 1.934 s
21/03/30 23:19:27 INFO DAGScheduler: looking for newly runnable stages

规格
我用的是aws ec2 g4dn机器。

GPU: TU104GL [Tesla T4]   
15109MiB  
Driver Version: 460.32.03  
CUDA Version: 11.2

1 worker: 1 core, 7GB of memory.

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题