DeepSeek/Qwen 在 Kubernetes 上的生产级部署指南
把大模型跑起来很容易,但在 K8s 上稳定运行是另一回事。本文基于实际生产经验,介绍如何在 Kubernetes 上部署 DeepSeek R1 或 Qwen3,覆盖从 GPU 调度到自动扩缩容的完整链路。
架构概览
用户请求
↓
Ingress / Gateway API
↓
LLM Gateway(限流、鉴权、路由)
↓
vLLM Service(多副本)
↓
GPU Node Pool(A10/A100/H100)
↓
模型权重(PVC / 对象存储)
前置条件:GPU 节点准备
# 安装 NVIDIA GPU Operator(自动管理驱动和 device plugin)
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--create-namespace \
--set driver.enabled=true \
--set toolkit.enabled=true
# 验证 GPU 节点
kubectl get nodes -l nvidia.com/gpu.present=true
kubectl describe node <gpu-node> | grep nvidia.com/gpu
给 GPU 节点打标签,便于调度:
kubectl label node <gpu-node-1> node-role=gpu-inference
kubectl label node <gpu-node-1> gpu-type=a10
kubectl taint node <gpu-node-1> nvidia.com/gpu=present:NoSchedule
模型权重存储
模型文件较大(Qwen3-14B 约 28GB),需要提前下载并挂载:
# 使用 PVC 存储模型权重
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-weights-pvc
namespace: llm-serving
spec:
accessModes:
- ReadWriteMany # 多副本共享读
storageClassName: nfs-storage # 需要支持 RWX 的存储类
resources:
requests:
storage: 100Gi
# 用 Job 下载模型到 PVC
kubectl apply -f - <<EOF
apiVersion: batch/v1
kind: Job
metadata:
name: download-qwen3-14b
namespace: llm-serving
spec:
template:
spec:
containers:
- name: downloader
image: python:3.11-slim
command:
- sh
- -c
- |
pip install huggingface_hub -q
huggingface-cli download Qwen/Qwen3-14B \
--local-dir /models/qwen3-14b \
--local-dir-use-symlinks False
volumeMounts:
- name: models
mountPath: /models
env:
- name: HF_ENDPOINT
value: "https://hf-mirror.com" # 国内镜像
volumes:
- name: models
persistentVolumeClaim:
claimName: model-weights-pvc
restartPolicy: Never
EOF
vLLM 部署
vLLM 是目前性能最好的开源 LLM 推理引擎,支持 PagedAttention、连续批处理:
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-qwen3-14b
namespace: llm-serving
spec:
replicas: 2
selector:
matchLabels:
app: vllm-qwen3-14b
template:
metadata:
labels:
app: vllm-qwen3-14b
spec:
nodeSelector:
node-role: gpu-inference
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- --model=/models/qwen3-14b
- --served-model-name=qwen3-14b
- --tensor-parallel-size=1 # 单卡,多卡改为 2/4
- --max-model-len=32768
- --max-num-seqs=256 # 最大并发请求数
- --gpu-memory-utilization=0.90
- --enable-prefix-caching # 开启 KV Cache 复用
- --host=0.0.0.0
- --port=8000
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: "1"
memory: "40Gi"
requests:
nvidia.com/gpu: "1"
memory: "32Gi"
volumeMounts:
- name: models
mountPath: /models
readOnly: true
- name: shm
mountPath: /dev/shm
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 10
failureThreshold: 30
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120
periodSeconds: 30
volumes:
- name: models
persistentVolumeClaim:
claimName: model-weights-pvc
- name: shm
emptyDir:
medium: Memory
sizeLimit: 8Gi # vLLM 需要共享内存
---
apiVersion: v1
kind: Service
metadata:
name: vllm-qwen3-14b
namespace: llm-serving
spec:
selector:
app: vllm-qwen3-14b
ports:
- port: 8000
targetPort: 8000
自动扩缩容(基于 GPU 利用率)
标准 HPA 不支持 GPU 指标,需要通过 DCGM Exporter + Prometheus Adapter:
# 安装 DCGM Exporter(GPU 指标采集)
helm install dcgm-exporter nvidia/dcgm-exporter \
--namespace gpu-operator \
--set serviceMonitor.enabled=true
# 基于自定义指标的 HPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm-hpa
namespace: llm-serving
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-qwen3-14b
minReplicas: 1
maxReplicas: 4
metrics:
# 基于请求队列长度扩容(更实用)
- type: Pods
pods:
metric:
name: vllm_num_requests_waiting
target:
type: AverageValue
averageValue: "10" # 等待队列超过 10 个请求时扩容
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Pods
value: 1
periodSeconds: 120 # 每 2 分钟最多扩 1 个(GPU 节点启动慢)
scaleDown:
stabilizationWindowSeconds: 300 # 缩容前等待 5 分钟
LLM Gateway:限流与路由
在 vLLM 前面加一层网关,处理鉴权、限流、多模型路由:
# 使用 LiteLLM Proxy 作为 LLM Gateway
apiVersion: apps/v1
kind: Deployment
metadata:
name: litellm-proxy
namespace: llm-serving
spec:
replicas: 2
selector:
matchLabels:
app: litellm-proxy
template:
metadata:
labels:
app: litellm-proxy
spec:
containers:
- name: litellm
image: ghcr.io/berriai/litellm:main-latest
args: ["--config", "/config/config.yaml", "--port", "4000"]
ports:
- containerPort: 4000
volumeMounts:
- name: config
mountPath: /config
volumes:
- name: config
configMap:
name: litellm-config
---
apiVersion: v1
kind: ConfigMap
metadata:
name: litellm-config
namespace: llm-serving
data:
config.yaml: |
model_list:
- model_name: qwen3-14b
litellm_params:
model: openai/qwen3-14b
api_base: http://vllm-qwen3-14b:8000/v1
api_key: "none"
- model_name: deepseek-r1
litellm_params:
model: openai/deepseek-r1
api_base: http://vllm-deepseek-r1:8000/v1
api_key: "none"
router_settings:
routing_strategy: least-busy # 路由到最空闲的实例
general_settings:
master_key: "sk-your-master-key"
litellm_settings:
request_timeout: 600
num_retries: 2
监控:关键指标
# Prometheus ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: vllm-metrics
namespace: llm-serving
spec:
selector:
matchLabels:
app: vllm-qwen3-14b
endpoints:
- port: "8000"
path: /metrics
interval: 15s
vLLM 暴露的关键指标:
# 查看 vLLM 指标
curl http://vllm-service:8000/metrics | grep -E "vllm_(e2e_request|num_requests|gpu_cache)"
# 关键指标:
# vllm:e2e_request_latency_seconds - 端到端延迟
# vllm:num_requests_running - 正在处理的请求数
# vllm:num_requests_waiting - 等待队列长度
# vllm:gpu_cache_usage_perc - KV Cache 使用率(>80% 需扩容)
# vllm:time_to_first_token_seconds - TTFT(首 token 延迟)
Grafana 告警规则:
groups:
- name: llm-serving
rules:
- alert: LLMHighWaitQueue
expr: vllm:num_requests_waiting > 20
for: 2m
annotations:
summary: "LLM 请求队列积压,考虑扩容"
- alert: LLMHighTTFT
expr: histogram_quantile(0.95, vllm:time_to_first_token_seconds_bucket) > 5
for: 5m
annotations:
summary: "P95 首 token 延迟超过 5 秒"
- alert: LLMGPUCacheNearFull
expr: vllm:gpu_cache_usage_perc > 0.85
for: 5m
annotations:
summary: "GPU KV Cache 使用率过高,可能影响吞吐"
生产经验总结
模型加载优化:
- 使用
--load-format safetensors加快加载速度 - 多副本共享 PVC(ReadWriteMany),避免重复下载
- 考虑使用
init container预热模型
稳定性:
- 设置合理的
--max-model-len,避免 OOM readinessProbe的initialDelaySeconds要足够长(模型加载需要 1-3 分钟)- 使用 PodDisruptionBudget 保证滚动更新时至少有一个副本可用
成本控制:
- 非高峰期用 KEDA 缩容到 0(冷启动约 3 分钟,适合非实时场景)
- 混合部署:高频请求用 GPU,低频用 CPU(Qwen3-7B 量化版可跑在 CPU 上)
# PodDisruptionBudget
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: vllm-pdb
namespace: llm-serving
spec:
minAvailable: 1
selector:
matchLabels:
app: vllm-qwen3-14b
在 K8s 上运行大模型的核心挑战是 GPU 资源的调度和模型加载时间。一旦解决这两个问题,后续的扩缩容、监控、灰度发布都可以复用标准的 K8s 运维体系。