云原生环境中的监控与可观测性最佳实践

张开发
2026/4/11 0:44:45 15 分钟阅读

分享文章

云原生环境中的监控与可观测性最佳实践
云原生环境中的监控与可观测性最佳实践 硬核开场各位技术老铁今天咱们聊聊云原生环境中的监控与可观测性最佳实践。别跟我扯那些理论直接上干货在云原生时代监控与可观测性已经成为系统运维的核心需求。不搞监控与可观测性那你的系统可能在出现问题时无法及时发现和解决导致服务中断损失惨重。 核心概念可观测性是什么可观测性Observability是指通过系统产生的外部输出如日志、指标、追踪来了解系统内部状态的能力。在云原生环境中可观测性包括三个核心支柱日志Logging、指标Metrics和追踪Tracing。监控与可观测性的区别监控Monitoring是指通过收集和分析系统的指标检测系统的异常状态并发出告警。而可观测性则是一个更广泛的概念它不仅包括监控还包括通过日志和追踪来理解系统的行为和性能。 实践指南1. 指标监控Prometheus部署# 安装Prometheus helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm install prometheus prometheus-community/prometheus --namespace monitoring --create-namespace # 查看Prometheus状态 kubectl get pods -n monitoring指标采集配置apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: app-monitor namespace: monitoring spec: selector: matchLabels: app: myapp endpoints: - port: metrics interval: 15s2. 日志管理Loki部署# 安装Loki helm repo add grafana https://grafana.github.io/helm-charts helm install loki grafana/loki --namespace monitoring # 安装Promtail helm install promtail grafana/promtail --namespace monitoring日志采集配置apiVersion: v1 kind: ConfigMap metadata: name: promtail-config namespace: monitoring data: promtail.yaml: | server: http_listen_port: 9080 grpc_listen_port: 0 clients: - url: http://loki:3100/loki/api/v1/push scrape_configs: - job_name: kubernetes-pods kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_label_app] target_label: app3. 分布式追踪Jaeger部署# 安装Jaeger helm repo add jaegertracing https://jaegertracing.github.io/helm-charts helm install jaeger jaegertracing/jaeger --namespace monitoring追踪配置apiVersion: apps/v1 kind: Deployment metadata: name: myapp namespace: default spec: replicas: 3 selector: matchLabels: app: myapp template: metadata: labels: app: myapp spec: containers: - name: myapp image: myapp:latest env: - name: JAEGER_SERVICE_NAME value: myapp - name: JAEGER_AGENT_HOST value: jaeger-agent.monitoring.svc.cluster.local - name: JAEGER_AGENT_PORT value: 68314. 可视化Grafana部署# 安装Grafana helm repo add grafana https://grafana.github.io/helm-charts helm install grafana grafana/grafana --namespace monitoring # 查看Grafana状态 kubectl get pods -n monitoring # 暴露Grafana服务 kubectl port-forward svc/grafana -n monitoring 3000:80仪表板配置{ annotations: { list: [] }, editable: true, gnetId: null, graphTooltip: 0, id: 1, links: [], panels: [ { aliasColors: {}, bars: false, dashLength: 10, dashes: false, datasource: Prometheus, fieldConfig: { defaults: { custom: {} }, overrides: [] }, fill: 1, fillGradient: 0, gridPos: { h: 8, w: 12, x: 0, y: 0 }, hiddenSeries: false, id: 2, legend: { avg: false, current: false, max: false, min: false, show: true, total: false, values: false }, lines: true, linewidth: 1, nullPointMode: null, options: { alertThreshold: true }, percentage: false, pluginVersion: 7.5.1, pointradius: 2, points: false, renderer: flot, seriesOverrides: [], spaceLength: 10, stack: false, steppedLine: false, targets: [ { expr: rate(http_requests_total{job\myapp\}[5m]), interval: , legendFormat: {{handler}}, refId: A } ], thresholds: [], timeFrom: null, timeRegions: [], timeShift: null, title: HTTP Requests, tooltip: { shared: true, sort: 0, value_type: individual }, type: graph, xaxis: { buckets: null, mode: time, name: null, show: true, values: [] }, yaxes: [ { format: short, label: null, logBase: 1, max: null, min: null, show: true }, { format: short, label: null, logBase: 1, max: null, min: null, show: true } ], yaxis: { align: false, alignLevel: null } } ], refresh: 10s, schemaVersion: 26, style: dark, tags: [], templating: { list: [] }, time: { from: now-6h, to: now }, timepicker: {}, timezone: , title: My App Dashboard, uid: myapp-dashboard, version: 1 }5. 告警配置Prometheus告警规则apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: app-alerts namespace: monitoring spec: groups: - name: app rules: - alert: HighCPUUsage expr: avg(cpu_usage{jobmyapp}) 80 for: 5m labels: severity: warning annotations: summary: High CPU usage description: CPU usage is above 80% for 5 minutes - alert: HighMemoryUsage expr: avg(memory_usage{jobmyapp}) 80 for: 5m labels: severity: warning annotations: summary: High memory usage description: Memory usage is above 80% for 5 minutes - alert: ServiceDown expr: up{jobmyapp} 0 for: 5m labels: severity: critical annotations: summary: Service down description: Service {{ $labels.instance }} is down for 5 minutesAlertmanager配置apiVersion: monitoring.coreos.com/v1 kind: Alertmanager metadata: name: alertmanager namespace: monitoring spec: replicas: 1 alertmanagerConfigSelector: matchLabels: app: alertmanager resources: requests: cpu: 100m memory: 100Mi limits: cpu: 100m memory: 100Mi --- apiVersion: monitoring.coreos.com/v1 kind: AlertmanagerConfig metadata: name: alertmanager-config namespace: monitoring labels: app: alertmanager spec: receivers: - name: email emailConfigs: - to: adminexample.com from: alertmanagerexample.com smarthost: smtp.example.com:587 authUsername: alertmanager authPassword: name: smtp-secret key: password route: groupBy: - alertname - cluster - service groupInterval: 5m groupWait: 30s receiver: email repeatInterval: 1h6. 可观测性最佳实践统一可观测性平台# 安装Grafana Agent helm repo add grafana https://grafana.github.io/helm-charts helm install grafana-agent grafana/grafana-agent --namespace monitoring可观测性配置apiVersion: monitoring.grafana.com/v1alpha1 kind: GrafanaAgent metadata: name: grafana-agent namespace: monitoring spec: metrics: instanceSelector: matchLabels: app: myapp prometheus: externalLabels: cluster: production remoteWrite: - url: http://prometheus-server.monitoring.svc.cluster.local:9090/api/v1/write logs: instanceSelector: matchLabels: app: myapp loki: clients: - url: http://loki.monitoring.svc.cluster.local:3100/loki/api/v1/push traces: instanceSelector: matchLabels: app: myapp otlp: protocols: grpc: endpoint: jaeger-collector.monitoring.svc.cluster.local:4317 最佳实践1. 指标监控合理选择指标选择与业务相关的关键指标避免监控过多无关的指标设置合理的告警阈值根据系统的实际情况设置合理的告警阈值使用标签使用标签对指标进行分类便于查询和分析定期清理指标定期清理不再需要的指标避免指标爆炸指标聚合对指标进行合理的聚合减少存储和查询压力2. 日志管理结构化日志使用结构化日志格式便于分析和查询日志级别合理设置日志级别避免过多的低级别日志日志轮转配置日志轮转避免日志文件过大日志压缩对日志进行压缩减少存储占用日志保留策略设置合理的日志保留策略平衡存储成本和可追溯性3. 分布式追踪全链路追踪确保所有服务都接入分布式追踪实现全链路可观测合理设置采样率根据系统的流量和性能需求设置合理的采样率上下文传递确保追踪上下文在服务间正确传递业务标签为追踪添加业务相关的标签便于分析业务流程追踪分析定期分析追踪数据发现系统的性能瓶颈4. 可视化统一仪表板创建统一的仪表板集中展示系统的关键指标自定义仪表板根据不同角色的需求创建自定义仪表板实时监控确保仪表板实时更新及时反映系统状态告警集成将告警信息集成到仪表板中便于查看和处理仪表板分享分享仪表板给相关团队成员提高协作效率5. 告警管理分级告警根据告警的严重程度设置不同的告警级别告警聚合对相关的告警进行聚合避免告警风暴告警抑制在特定情况下抑制不必要的告警告警路由根据告警的类型和级别路由到不同的接收者告警自动化对常见的告警进行自动化处理减少人工干预6. 可观测性平台统一平台使用统一的可观测性平台集成指标、日志和追踪标准化标准化可观测性数据的格式和采集方式自动化自动化可观测性配置的部署和管理可扩展性确保可观测性平台能够随着系统的增长而扩展安全性确保可观测性平台的安全性保护敏感数据 实战案例案例金融科技公司的可观测性实践背景某金融科技公司需要构建一个高可用、可扩展

更多文章