kubernetes Keda ScaledObject度量数据与Prometheus不同

iq0todco  于 2023-05-06  发布在  Kubernetes
关注(0)|答案(1)|浏览(262)

我在云GPU提供商中创建了一个Keda ScaledObject,它通过Prometheus示例公开了各种指标,例如:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: [name]
  namespace: [namespace]
spec:
  cooldownPeriod: 30
  fallback:
    failureThreshold: 20
    replicas: 0
  maxReplicaCount: 4
  minReplicaCount: 1
  pollingInterval: 15
  scaleTargetRef:
    name: [deployment]
  triggers:
  - metadata:
      metricName: gpu-util
      metricType: Value
      query: |-
        avg(avg_over_time(DCGM_FI_DEV_GPU_UTIL[1m]))
      serverAddress: [address]:9090
      threshold: '80'
    type: prometheus

DCGM_FI_DEV_GPU_UTIL是衡量GPU利用率的NVIDIA指标。它创建了一个看起来运行正常的ScaledObject

$ kubectl describe scaledobject [name] -n [namespace]
Name:         [name]
Namespace:    [namespace]
Labels:       scaledobject.keda.sh/name=[name]
Annotations:  <none>
API Version:  keda.sh/v1alpha1
Kind:         ScaledObject
Metadata:
  Creation Timestamp:  2023-04-28T01:27:50Z
  Finalizers:
    finalizer.keda.sh
  Generation:        1
  Resource Version:  36215438066
  UID:               [uid]
Spec:
  Cooldown Period:  30
  Fallback:
    Failure Threshold:  20
    Replicas:           0
  Max Replica Count:    4
  Min Replica Count:    1
  Polling Interval:     15
  Scale Target Ref:
    Name:  hashtop-1
  Triggers:
    Metadata:
      Metric Name:     gpu-util
      Namespace:       [namespace]
      Query:           avg(avg_over_time(DCGM_FI_DEV_GPU_UTIL[1m]))
      Server Address:  [url]:9090
      Threshold:       80
    Type:              prometheus
Status:
  Conditions:
    Message:  ScaledObject is defined correctly and is ready for scaling
    Reason:   ScaledObjectReady
    Status:   True
    Type:     Ready
    Message:  Scaling is not performed because triggers are not active
    Reason:   ScalerNotActive
    Status:   False
    Type:     Active
    Message:  No fallbacks are active on this scaled object
    Reason:   NoFallbackFound
    Status:   False
    Type:     Fallback
  External Metric Names:
    s0-prometheus-gpu-util
  Health:
    s0-prometheus-gpu-util:
      Number Of Failures:  0
      Status:              Happy
  Original Replica Count:  1
  Scale Target GVKR:
    Group:            apps
    Kind:             Deployment
    Resource:         deployments
    Version:          v1
  Scale Target Kind:  apps/v1.Deployment
Events:
  Type    Reason              Age                From           Message
  ----    ------              ----               ----           -------
  Normal  KEDAScalersStarted  78s                keda-operator  Started scalers watch
  Normal  ScaledObjectReady   63s (x2 over 78s)  keda-operator  ScaledObject is ready for scaling

当我直接在Prometheus中运行这个查询时,我收到了预期的结果:

# heavy utilization
$ curl '[url]/api/v1/query?query=avg(avg_over_time(DCGM_FI_DEV_GPU_UTIL\[1m\]))' | jq '.'
{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {},
        "value": [
          1682728336,
          "99"
        ]
      }
    ]
  }
}

在这里,我可以看到结果是"99" % utilization。当GPU空闲时,它会快速转到"0"
但是,Keda创建的HorizontalPodAutoscaler似乎与此数据不匹配。例如,当我有一个GPU空闲时,Prometheus在上面的查询中返回"0",HPA看起来像这样:

# one GPU idle
$ kubectl get hpa
NAME                                 REFERENCE              TARGETS       MINPODS   MAXPODS   REPLICAS   AGE
[name]   Deployment/[deployment]   19/80 (avg)   1         4         1          28m

该值在18-20附近徘徊,但从未离开该范围。使用四个挂钩GPU,HPA报告非常高的数字:

# Four GPUs 99% utilization
$ kubectl get hpa
NAME                                 REFERENCE              TARGETS              MINPODS   MAXPODS   REPLICAS   AGE
[name]   Deployment/[deployment]   36117500m/80 (avg)   1         4         4          2m27s

因此,我想要的自动缩放行为无法实现。由于这是在云提供商,我没有直接访问科达运营商本身。
我可以在ScaledObject定义中更改什么来创建基于Prometheus的GPU利用率指标进行扩展的HPA?

ghhkc1vu

ghhkc1vu1#

这对于有更多Prometheus经验的人来说可能是显而易见的,但是通过反复试验,我通过将查询限制为pod正则表达式来解决这个问题:

query: |-
        sum(avg_over_time(DCGM_FI_DEV_GPU_UTIL{pod=~"gpu-.*"}[1m]))

我只能猜测,从Prometheus API返回的数据与Keda ScaledObject接收的数据的范围不同,导致了不匹配。

相关问题