Prometheus + Grafana 监控体系从零搭建

监控是 SRE 的基础设施,本文介绍如何用 Prometheus + Grafana 搭建一套生产可用的监控体系。

架构概览

应用/系统 → Exporter → Prometheus → Grafana(可视化)
                                  → Alertmanager → 钉钉/飞书/PagerDuty

部署 Prometheus

使用 Helm 部署(推荐)

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set grafana.adminPassword=your-password

这个 chart 包含了 Prometheus、Grafana、Alertmanager 和常用 Exporter。

核心配置

# prometheus-values.yaml
prometheus:
  prometheusSpec:
    retention: 15d          # 数据保留 15 天
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: fast-ssd
          resources:
            requests:
              storage: 50Gi
    resources:
      requests:
        memory: 2Gi
        cpu: 500m
      limits:
        memory: 4Gi

配置 Exporter

Node Exporter(主机指标)

kube-prometheus-stack 已内置,采集 CPU、内存、磁盘、网络等指标。

自定义应用指标

应用暴露 /metrics 端点,Prometheus 自动发现:

# ServiceMonitor 示例
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-app
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: my-app
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics

告警规则配置

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: node-alerts
  namespace: monitoring
spec:
  groups:
  - name: node
    rules:
    - alert: HighCPUUsage
      expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "节点 CPU 使用率过高"
        description: "{{ $labels.instance }} CPU 使用率 {{ $value | printf \"%.1f\" }}%"

    - alert: DiskSpaceLow
      expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 15
      for: 10m
      labels:
        severity: critical
      annotations:
        summary: "磁盘空间不足"
        description: "{{ $labels.instance }} {{ $labels.mountpoint }} 剩余 {{ $value | printf \"%.1f\" }}%"

Alertmanager 配置飞书通知

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'cluster']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: feishu

receivers:
- name: feishu
  webhook_configs:
  - url: 'https://open.feishu.cn/open-apis/bot/v2/hook/YOUR-TOKEN'
    send_resolved: true

常用 PromQL 查询

# 集群 CPU 使用率
1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m]))

# 内存使用率
1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)

# Pod 重启次数(过去1小时)
increase(kube_pod_container_status_restarts_total[1h]) > 0

# HTTP 请求错误率
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])

# P99 响应时间
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

Grafana Dashboard 推荐

  • Node Exporter Full:ID 1860,主机监控
  • Kubernetes Cluster:ID 7249,集群概览
  • Kubernetes Pods:ID 6417,Pod 详情

监控体系搭建完成后,重点是持续完善告警规则,避免告警风暴和漏报。

← 返回文章列表