DeepSeek/Qwen 在 Kubernetes 上的生产级部署指南

把大模型跑起来很容易,但在 K8s 上稳定运行是另一回事。本文基于实际生产经验,介绍如何在 Kubernetes 上部署 DeepSeek R1 或 Qwen3,覆盖从 GPU 调度到自动扩缩容的完整链路。

架构概览

用户请求

Ingress / Gateway API

LLM Gateway(限流、鉴权、路由)

vLLM Service(多副本)

GPU Node Pool(A10/A100/H100)

模型权重(PVC / 对象存储)

前置条件:GPU 节点准备

# 安装 NVIDIA GPU Operator(自动管理驱动和 device plugin)
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=true \
  --set toolkit.enabled=true

# 验证 GPU 节点
kubectl get nodes -l nvidia.com/gpu.present=true
kubectl describe node <gpu-node> | grep nvidia.com/gpu

给 GPU 节点打标签,便于调度:

kubectl label node <gpu-node-1> node-role=gpu-inference
kubectl label node <gpu-node-1> gpu-type=a10
kubectl taint node <gpu-node-1> nvidia.com/gpu=present:NoSchedule

模型权重存储

模型文件较大(Qwen3-14B 约 28GB),需要提前下载并挂载:

# 使用 PVC 存储模型权重
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-weights-pvc
  namespace: llm-serving
spec:
  accessModes:
  - ReadWriteMany    # 多副本共享读
  storageClassName: nfs-storage   # 需要支持 RWX 的存储类
  resources:
    requests:
      storage: 100Gi
# 用 Job 下载模型到 PVC
kubectl apply -f - <<EOF
apiVersion: batch/v1
kind: Job
metadata:
  name: download-qwen3-14b
  namespace: llm-serving
spec:
  template:
    spec:
      containers:
      - name: downloader
        image: python:3.11-slim
        command:
        - sh
        - -c
        - |
          pip install huggingface_hub -q
          huggingface-cli download Qwen/Qwen3-14B \
            --local-dir /models/qwen3-14b \
            --local-dir-use-symlinks False
        volumeMounts:
        - name: models
          mountPath: /models
        env:
        - name: HF_ENDPOINT
          value: "https://hf-mirror.com"   # 国内镜像
      volumes:
      - name: models
        persistentVolumeClaim:
          claimName: model-weights-pvc
      restartPolicy: Never
EOF

vLLM 部署

vLLM 是目前性能最好的开源 LLM 推理引擎,支持 PagedAttention、连续批处理:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-qwen3-14b
  namespace: llm-serving
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vllm-qwen3-14b
  template:
    metadata:
      labels:
        app: vllm-qwen3-14b
    spec:
      nodeSelector:
        node-role: gpu-inference
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
        - --model=/models/qwen3-14b
        - --served-model-name=qwen3-14b
        - --tensor-parallel-size=1    # 单卡,多卡改为 2/4
        - --max-model-len=32768
        - --max-num-seqs=256          # 最大并发请求数
        - --gpu-memory-utilization=0.90
        - --enable-prefix-caching     # 开启 KV Cache 复用
        - --host=0.0.0.0
        - --port=8000
        ports:
        - containerPort: 8000
        resources:
          limits:
            nvidia.com/gpu: "1"
            memory: "40Gi"
          requests:
            nvidia.com/gpu: "1"
            memory: "32Gi"
        volumeMounts:
        - name: models
          mountPath: /models
          readOnly: true
        - name: shm
          mountPath: /dev/shm
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 60
          periodSeconds: 10
          failureThreshold: 30
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 120
          periodSeconds: 30
      volumes:
      - name: models
        persistentVolumeClaim:
          claimName: model-weights-pvc
      - name: shm
        emptyDir:
          medium: Memory
          sizeLimit: 8Gi    # vLLM 需要共享内存

---
apiVersion: v1
kind: Service
metadata:
  name: vllm-qwen3-14b
  namespace: llm-serving
spec:
  selector:
    app: vllm-qwen3-14b
  ports:
  - port: 8000
    targetPort: 8000

自动扩缩容(基于 GPU 利用率)

标准 HPA 不支持 GPU 指标,需要通过 DCGM Exporter + Prometheus Adapter:

# 安装 DCGM Exporter(GPU 指标采集)
helm install dcgm-exporter nvidia/dcgm-exporter \
  --namespace gpu-operator \
  --set serviceMonitor.enabled=true
# 基于自定义指标的 HPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-hpa
  namespace: llm-serving
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-qwen3-14b
  minReplicas: 1
  maxReplicas: 4
  metrics:
  # 基于请求队列长度扩容(更实用)
  - type: Pods
    pods:
      metric:
        name: vllm_num_requests_waiting
      target:
        type: AverageValue
        averageValue: "10"   # 等待队列超过 10 个请求时扩容
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Pods
        value: 1
        periodSeconds: 120   # 每 2 分钟最多扩 1 个(GPU 节点启动慢)
    scaleDown:
      stabilizationWindowSeconds: 300  # 缩容前等待 5 分钟

LLM Gateway:限流与路由

在 vLLM 前面加一层网关,处理鉴权、限流、多模型路由:

# 使用 LiteLLM Proxy 作为 LLM Gateway
apiVersion: apps/v1
kind: Deployment
metadata:
  name: litellm-proxy
  namespace: llm-serving
spec:
  replicas: 2
  selector:
    matchLabels:
      app: litellm-proxy
  template:
    metadata:
      labels:
        app: litellm-proxy
    spec:
      containers:
      - name: litellm
        image: ghcr.io/berriai/litellm:main-latest
        args: ["--config", "/config/config.yaml", "--port", "4000"]
        ports:
        - containerPort: 4000
        volumeMounts:
        - name: config
          mountPath: /config
      volumes:
      - name: config
        configMap:
          name: litellm-config

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: litellm-config
  namespace: llm-serving
data:
  config.yaml: |
    model_list:
      - model_name: qwen3-14b
        litellm_params:
          model: openai/qwen3-14b
          api_base: http://vllm-qwen3-14b:8000/v1
          api_key: "none"

      - model_name: deepseek-r1
        litellm_params:
          model: openai/deepseek-r1
          api_base: http://vllm-deepseek-r1:8000/v1
          api_key: "none"

    router_settings:
      routing_strategy: least-busy   # 路由到最空闲的实例

    general_settings:
      master_key: "sk-your-master-key"

    litellm_settings:
      request_timeout: 600
      num_retries: 2

监控:关键指标

# Prometheus ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: vllm-metrics
  namespace: llm-serving
spec:
  selector:
    matchLabels:
      app: vllm-qwen3-14b
  endpoints:
  - port: "8000"
    path: /metrics
    interval: 15s

vLLM 暴露的关键指标:

# 查看 vLLM 指标
curl http://vllm-service:8000/metrics | grep -E "vllm_(e2e_request|num_requests|gpu_cache)"

# 关键指标:
# vllm:e2e_request_latency_seconds    - 端到端延迟
# vllm:num_requests_running           - 正在处理的请求数
# vllm:num_requests_waiting           - 等待队列长度
# vllm:gpu_cache_usage_perc           - KV Cache 使用率(>80% 需扩容)
# vllm:time_to_first_token_seconds    - TTFT(首 token 延迟)

Grafana 告警规则:

groups:
- name: llm-serving
  rules:
  - alert: LLMHighWaitQueue
    expr: vllm:num_requests_waiting > 20
    for: 2m
    annotations:
      summary: "LLM 请求队列积压,考虑扩容"

  - alert: LLMHighTTFT
    expr: histogram_quantile(0.95, vllm:time_to_first_token_seconds_bucket) > 5
    for: 5m
    annotations:
      summary: "P95 首 token 延迟超过 5 秒"

  - alert: LLMGPUCacheNearFull
    expr: vllm:gpu_cache_usage_perc > 0.85
    for: 5m
    annotations:
      summary: "GPU KV Cache 使用率过高,可能影响吞吐"

生产经验总结

模型加载优化:

  • 使用 --load-format safetensors 加快加载速度
  • 多副本共享 PVC(ReadWriteMany),避免重复下载
  • 考虑使用 init container 预热模型

稳定性:

  • 设置合理的 --max-model-len,避免 OOM
  • readinessProbeinitialDelaySeconds 要足够长(模型加载需要 1-3 分钟)
  • 使用 PodDisruptionBudget 保证滚动更新时至少有一个副本可用

成本控制:

  • 非高峰期用 KEDA 缩容到 0(冷启动约 3 分钟,适合非实时场景)
  • 混合部署:高频请求用 GPU,低频用 CPU(Qwen3-7B 量化版可跑在 CPU 上)
# PodDisruptionBudget
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: vllm-pdb
  namespace: llm-serving
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: vllm-qwen3-14b

在 K8s 上运行大模型的核心挑战是 GPU 资源的调度和模型加载时间。一旦解决这两个问题,后续的扩缩容、监控、灰度发布都可以复用标准的 K8s 运维体系。

← 返回文章列表