Prometheus + Grafana 监控体系从零搭建
监控是 SRE 的基础设施,本文介绍如何用 Prometheus + Grafana 搭建一套生产可用的监控体系。
架构概览
应用/系统 → Exporter → Prometheus → Grafana(可视化)
→ Alertmanager → 钉钉/飞书/PagerDuty
部署 Prometheus
使用 Helm 部署(推荐)
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set grafana.adminPassword=your-password
这个 chart 包含了 Prometheus、Grafana、Alertmanager 和常用 Exporter。
核心配置
# prometheus-values.yaml
prometheus:
prometheusSpec:
retention: 15d # 数据保留 15 天
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: fast-ssd
resources:
requests:
storage: 50Gi
resources:
requests:
memory: 2Gi
cpu: 500m
limits:
memory: 4Gi
配置 Exporter
Node Exporter(主机指标)
kube-prometheus-stack 已内置,采集 CPU、内存、磁盘、网络等指标。
自定义应用指标
应用暴露 /metrics 端点,Prometheus 自动发现:
# ServiceMonitor 示例
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-app
namespace: monitoring
spec:
selector:
matchLabels:
app: my-app
endpoints:
- port: metrics
interval: 30s
path: /metrics
告警规则配置
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: node-alerts
namespace: monitoring
spec:
groups:
- name: node
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
for: 5m
labels:
severity: warning
annotations:
summary: "节点 CPU 使用率过高"
description: "{{ $labels.instance }} CPU 使用率 {{ $value | printf \"%.1f\" }}%"
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 15
for: 10m
labels:
severity: critical
annotations:
summary: "磁盘空间不足"
description: "{{ $labels.instance }} {{ $labels.mountpoint }} 剩余 {{ $value | printf \"%.1f\" }}%"
Alertmanager 配置飞书通知
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'cluster']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: feishu
receivers:
- name: feishu
webhook_configs:
- url: 'https://open.feishu.cn/open-apis/bot/v2/hook/YOUR-TOKEN'
send_resolved: true
常用 PromQL 查询
# 集群 CPU 使用率
1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m]))
# 内存使用率
1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
# Pod 重启次数(过去1小时)
increase(kube_pod_container_status_restarts_total[1h]) > 0
# HTTP 请求错误率
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])
# P99 响应时间
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
Grafana Dashboard 推荐
- Node Exporter Full:ID 1860,主机监控
- Kubernetes Cluster:ID 7249,集群概览
- Kubernetes Pods:ID 6417,Pod 详情
监控体系搭建完成后,重点是持续完善告警规则,避免告警风暴和漏报。