我在云GPU提供商中创建了一个Keda ScaledObject
,它通过Prometheus示例公开了各种指标,例如:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: [name]
namespace: [namespace]
spec:
cooldownPeriod: 30
fallback:
failureThreshold: 20
replicas: 0
maxReplicaCount: 4
minReplicaCount: 1
pollingInterval: 15
scaleTargetRef:
name: [deployment]
triggers:
- metadata:
metricName: gpu-util
metricType: Value
query: |-
avg(avg_over_time(DCGM_FI_DEV_GPU_UTIL[1m]))
serverAddress: [address]:9090
threshold: '80'
type: prometheus
DCGM_FI_DEV_GPU_UTIL
是衡量GPU利用率的NVIDIA指标。它创建了一个看起来运行正常的ScaledObject
:
$ kubectl describe scaledobject [name] -n [namespace]
Name: [name]
Namespace: [namespace]
Labels: scaledobject.keda.sh/name=[name]
Annotations: <none>
API Version: keda.sh/v1alpha1
Kind: ScaledObject
Metadata:
Creation Timestamp: 2023-04-28T01:27:50Z
Finalizers:
finalizer.keda.sh
Generation: 1
Resource Version: 36215438066
UID: [uid]
Spec:
Cooldown Period: 30
Fallback:
Failure Threshold: 20
Replicas: 0
Max Replica Count: 4
Min Replica Count: 1
Polling Interval: 15
Scale Target Ref:
Name: hashtop-1
Triggers:
Metadata:
Metric Name: gpu-util
Namespace: [namespace]
Query: avg(avg_over_time(DCGM_FI_DEV_GPU_UTIL[1m]))
Server Address: [url]:9090
Threshold: 80
Type: prometheus
Status:
Conditions:
Message: ScaledObject is defined correctly and is ready for scaling
Reason: ScaledObjectReady
Status: True
Type: Ready
Message: Scaling is not performed because triggers are not active
Reason: ScalerNotActive
Status: False
Type: Active
Message: No fallbacks are active on this scaled object
Reason: NoFallbackFound
Status: False
Type: Fallback
External Metric Names:
s0-prometheus-gpu-util
Health:
s0-prometheus-gpu-util:
Number Of Failures: 0
Status: Happy
Original Replica Count: 1
Scale Target GVKR:
Group: apps
Kind: Deployment
Resource: deployments
Version: v1
Scale Target Kind: apps/v1.Deployment
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal KEDAScalersStarted 78s keda-operator Started scalers watch
Normal ScaledObjectReady 63s (x2 over 78s) keda-operator ScaledObject is ready for scaling
当我直接在Prometheus中运行这个查询时,我收到了预期的结果:
# heavy utilization
$ curl '[url]/api/v1/query?query=avg(avg_over_time(DCGM_FI_DEV_GPU_UTIL\[1m\]))' | jq '.'
{
"status": "success",
"data": {
"resultType": "vector",
"result": [
{
"metric": {},
"value": [
1682728336,
"99"
]
}
]
}
}
在这里,我可以看到结果是"99"
% utilization。当GPU空闲时,它会快速转到"0"
。
但是,Keda创建的HorizontalPodAutoscaler
似乎与此数据不匹配。例如,当我有一个GPU空闲时,Prometheus在上面的查询中返回"0"
,HPA看起来像这样:
# one GPU idle
$ kubectl get hpa
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
[name] Deployment/[deployment] 19/80 (avg) 1 4 1 28m
该值在18-20
附近徘徊,但从未离开该范围。使用四个挂钩GPU,HPA报告非常高的数字:
# Four GPUs 99% utilization
$ kubectl get hpa
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
[name] Deployment/[deployment] 36117500m/80 (avg) 1 4 4 2m27s
因此,我想要的自动缩放行为无法实现。由于这是在云提供商,我没有直接访问科达运营商本身。
我可以在ScaledObject
定义中更改什么来创建基于Prometheus的GPU利用率指标进行扩展的HPA?
1条答案
按热度按时间ghhkc1vu1#
这对于有更多Prometheus经验的人来说可能是显而易见的,但是通过反复试验,我通过将查询限制为pod正则表达式来解决这个问题:
我只能猜测,从Prometheus API返回的数据与Keda
ScaledObject
接收的数据的范围不同,导致了不匹配。