DevOps 工程师 (DevOps Engineer)
DevOps 工程师是 Claude Code 的基础设施和自动化专家,负责 CI/CD 流程、容器化、云服务部署和系统监控。
核心职责
主要能力
- CI/CD 管道: 设计和实现持续集成/持续部署流程
- 容器化: Docker 和 Kubernetes 的应用部署
- 基础设施即代码: 使用 Terraform、Ansible 等工具
- 监控和日志: 系统监控、日志管理和告警设置
专业领域
- 自动化部署流程
- 云服务管理 (AWS/GCP/Azure)
- 容器编排和微服务
- 性能监控和优化
使用场景
何时使用 DevOps 工程师
适合的场景
bash
# CI/CD 设置
"为这个项目设置 GitHub Actions 自动化部署"
# 容器化
"将这个应用容器化并创建 Docker Compose 配置"
# 云部署
"部署应用到 AWS ECS 并配置自动扩展"
# 监控设置
"配置 Prometheus 和 Grafana 监控系统"不适合的场景
bash
# 代码开发 (应使用执行器)
"实现一个新的 API 端点"
# 架构设计 (应使用架构师)
"设计微服务架构"
# 代码调试 (应使用调试器)
"修复应用崩溃问题"DevOps 能力
1. CI/CD 流程设计
GitHub Actions 工作流
yaml
# .github/workflows/deploy.yml
name: CI/CD Pipeline
on:
push:
branches: [main, develop]
pull_request:
branches: [main]
env:
NODE_VERSION: '18'
DOCKER_REGISTRY: ghcr.io
IMAGE_NAME: ${{ github.repository }}
jobs:
# 代码质量检查
quality-check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: ${{ env.NODE_VERSION }}
cache: 'npm'
- name: Install dependencies
run: npm ci
- name: Run linting
run: npm run lint
- name: Run type checking
run: npm run type-check
- name: Run security audit
run: npm audit --audit-level=moderate
# 测试
test:
runs-on: ubuntu-latest
needs: quality-check
strategy:
matrix:
test-suite: [unit, integration, e2e]
steps:
- uses: actions/checkout@v4
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: ${{ env.NODE_VERSION }}
cache: 'npm'
- name: Install dependencies
run: npm ci
- name: Run ${{ matrix.test-suite }} tests
run: npm run test:${{ matrix.test-suite }}
- name: Upload coverage
if: matrix.test-suite == 'unit'
uses: codecov/codecov-action@v3
with:
file: ./coverage/lcov.info
flags: unittests
# 构建和推送 Docker 镜像
build-and-push:
runs-on: ubuntu-latest
needs: test
if: github.event_name == 'push'
permissions:
contents: read
packages: write
outputs:
image-tag: ${{ steps.meta.outputs.tags }}
steps:
- uses: actions/checkout@v4
- name: Log in to Container Registry
uses: docker/login-action@v3
with:
registry: ${{ env.DOCKER_REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Extract metadata
id: meta
uses: docker/metadata-action@v5
with:
images: ${{ env.DOCKER_REGISTRY }}/${{ env.IMAGE_NAME }}
tags: |
type=ref,event=branch
type=ref,event=pr
type=semver,pattern={{version}}
type=sha,prefix={{branch}}-
- name: Build and push Docker image
uses: docker/build-push-action@v5
with:
context: .
push: true
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
cache-from: type=gha
cache-to: type=gha,mode=max
build-args: |
NODE_VERSION=${{ env.NODE_VERSION }}
BUILD_DATE=${{ github.event.head_commit.timestamp }}
VCS_REF=${{ github.sha }}
# 部署到环境
deploy:
runs-on: ubuntu-latest
needs: build-and-push
if: github.ref == 'refs/heads/main' || github.ref == 'refs/heads/develop'
environment:
name: ${{ github.ref == 'refs/heads/main' && 'production' || 'staging' }}
url: ${{ steps.deploy.outputs.url }}
steps:
- uses: actions/checkout@v4
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: us-east-1
- name: Deploy to ECS
id: deploy
run: |
# 更新 ECS 任务定义
aws ecs register-task-definition \
--cli-input-json file://ecs-task-definition.json
# 更新服务
aws ecs update-service \
--cluster ${{ vars.ECS_CLUSTER }} \
--service ${{ vars.ECS_SERVICE }} \
--task-definition $(aws ecs describe-task-definition \
--task-definition ${{ vars.TASK_DEFINITION_FAMILY }} \
--query 'taskDefinition.taskDefinitionArn' \
--output text)
# 等待部署完成
aws ecs wait services-stable \
--cluster ${{ vars.ECS_CLUSTER }} \
--services ${{ vars.ECS_SERVICE }}
# 输出应用 URL
echo "url=https://${{ vars.APP_DOMAIN }}" >> $GITHUB_OUTPUT
- name: Notify deployment
uses: slackapi/slack-github-action@v1
with:
webhook-url: ${{ secrets.SLACK_WEBHOOK }}
payload: |
{
"text": "部署成功 ",
"blocks": [{
"type": "section",
"text": {
"type": "mrkdwn",
"text": "*环境:* ${{ github.ref == 'refs/heads/main' && '生产' || '预发布' }}\n*版本:* ${{ needs.build-and-push.outputs.image-tag }}\n*部署者:* ${{ github.actor }}\n*URL:* ${{ steps.deploy.outputs.url }}"
}
}]
}2. Docker 容器化
多阶段 Dockerfile
dockerfile
# Dockerfile
# 构建阶段
FROM node:18-alpine AS builder
# 安装构建依赖
RUN apk add --no-cache python3 make g++
WORKDIR /app
# 复制依赖文件
COPY package*.json ./
COPY yarn.lock* ./
COPY pnpm-lock.yaml* ./
# 安装依赖
RUN if [ -f yarn.lock ]; then yarn install --frozen-lockfile; \
elif [ -f pnpm-lock.yaml ]; then corepack enable && pnpm install --frozen-lockfile; \
else npm ci; fi
# 复制源代码
COPY . .
# 构建应用
RUN npm run build
# 清理开发依赖
RUN npm prune --production
# 运行阶段
FROM node:18-alpine AS runner
# 安全性:创建非 root 用户
RUN addgroup -g 1001 -S nodejs
RUN adduser -S nodejs -u 1001
# 安装运行时依赖
RUN apk add --no-cache dumb-init
WORKDIR /app
# 复制必要文件
COPY --from=builder --chown=nodejs:nodejs /app/dist ./dist
COPY --from=builder --chown=nodejs:nodejs /app/node_modules ./node_modules
COPY --from=builder --chown=nodejs:nodejs /app/package*.json ./
# 设置环境变量
ENV NODE_ENV production
ENV PORT 3000
# 暴露端口
EXPOSE 3000
# 切换到非 root 用户
USER nodejs
# 健康检查
HEALTHCHECK --interval=30s --timeout=3s --start-period=40s --retries=3 \
CMD node healthcheck.js || exit 1
# 使用 dumb-init 避免僵尸进程
ENTRYPOINT ["dumb-init", "--"]
# 启动应用
CMD ["node", "dist/server.js"]Docker Compose 配置
yaml
# docker-compose.yml
version: '3.9'
services:
# 应用服务
app:
build:
context: .
dockerfile: Dockerfile
cache_from:
- ${DOCKER_REGISTRY}/app:latest
image: ${DOCKER_REGISTRY}/app:${VERSION:-latest}
container_name: app
restart: unless-stopped
ports:
- "3000:3000"
environment:
- NODE_ENV=${NODE_ENV:-production}
- DATABASE_URL=postgresql://user:pass@postgres:5432/dbname
- REDIS_URL=redis://redis:6379
- JWT_SECRET=${JWT_SECRET}
- AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID}
- AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}
depends_on:
postgres:
condition: service_healthy
redis:
condition: service_healthy
networks:
- app-network
volumes:
- app-uploads:/app/uploads
logging:
driver: "json-file"
options:
max-size: "10m"
max-file: "3"
deploy:
resources:
limits:
cpus: '1'
memory: 1G
reservations:
cpus: '0.5'
memory: 512M
# PostgreSQL 数据库
postgres:
image: postgres:15-alpine
container_name: postgres
restart: unless-stopped
ports:
- "5432:5432"
environment:
- POSTGRES_USER=user
- POSTGRES_PASSWORD=pass
- POSTGRES_DB=dbname
volumes:
- postgres-data:/var/lib/postgresql/data
- ./init-db.sql:/docker-entrypoint-initdb.d/init.sql
networks:
- app-network
healthcheck:
test: ["CMD-SHELL", "pg_isready -U user"]
interval: 10s
timeout: 5s
retries: 5
# Redis 缓存
redis:
image: redis:7-alpine
container_name: redis
restart: unless-stopped
ports:
- "6379:6379"
command: redis-server --appendonly yes --requirepass ${REDIS_PASSWORD}
volumes:
- redis-data:/data
networks:
- app-network
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 10s
timeout: 5s
retries: 5
# Nginx 反向代理
nginx:
image: nginx:alpine
container_name: nginx
restart: unless-stopped
ports:
- "80:80"
- "443:443"
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf:ro
- ./ssl:/etc/nginx/ssl:ro
- nginx-cache:/var/cache/nginx
depends_on:
- app
networks:
- app-network
logging:
driver: "json-file"
options:
max-size: "10m"
max-file: "3"
# Prometheus 监控
prometheus:
image: prom/prometheus:latest
container_name: prometheus
restart: unless-stopped
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/usr/share/prometheus/console_libraries'
- '--web.console.templates=/usr/share/prometheus/consoles'
networks:
- app-network
# Grafana 仪表板
grafana:
image: grafana/grafana:latest
container_name: grafana
restart: unless-stopped
ports:
- "3001:3000"
environment:
- GF_SECURITY_ADMIN_USER=${GRAFANA_USER:-admin}
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD:-admin}
- GF_INSTALL_PLUGINS=grafana-clock-panel
volumes:
- grafana-data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
depends_on:
- prometheus
networks:
- app-network
networks:
app-network:
driver: bridge
volumes:
postgres-data:
redis-data:
app-uploads:
nginx-cache:
prometheus-data:
grafana-data:3. Kubernetes 部署
K8s 部署配置
yaml
# k8s/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: app-deployment
namespace: production
labels:
app: myapp
version: v1
spec:
replicas: 3
selector:
matchLabels:
app: myapp
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
template:
metadata:
labels:
app: myapp
version: v1
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "3000"
prometheus.io/path: "/metrics"
spec:
serviceAccountName: app-service-account
# 安全上下文
securityContext:
runAsNonRoot: true
runAsUser: 1001
fsGroup: 1001
# 初始化容器
initContainers:
- name: migration
image: myapp:latest
command: ["npm", "run", "migrate"]
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: app-secrets
key: database-url
containers:
- name: app
image: myapp:latest
imagePullPolicy: Always
ports:
- name: http
containerPort: 3000
protocol: TCP
# 环境变量
env:
- name: NODE_ENV
value: "production"
- name: PORT
value: "3000"
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: app-secrets
key: database-url
- name: JWT_SECRET
valueFrom:
secretKeyRef:
name: app-secrets
key: jwt-secret
# 资源限制
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
# 存活探针
livenessProbe:
httpGet:
path: /health
port: http
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
# 就绪探针
readinessProbe:
httpGet:
path: /ready
port: http
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
successThreshold: 1
failureThreshold: 3
# 启动探针
startupProbe:
httpGet:
path: /health
port: http
initialDelaySeconds: 0
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 30
# 挂载卷
volumeMounts:
- name: config
mountPath: /app/config
readOnly: true
- name: cache
mountPath: /app/cache
# 卷定义
volumes:
- name: config
configMap:
name: app-config
- name: cache
emptyDir:
sizeLimit: 1Gi
# 亲和性规则
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- myapp
topologyKey: kubernetes.io/hostname
---
# Service
apiVersion: v1
kind: Service
metadata:
name: app-service
namespace: production
labels:
app: myapp
spec:
type: ClusterIP
ports:
- port: 80
targetPort: http
protocol: TCP
name: http
selector:
app: myapp
---
# Horizontal Pod Autoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: app-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: app-deployment
minReplicas: 3
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 15
- type: Pods
value: 4
periodSeconds: 15
selectPolicy: Max
---
# Ingress
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: app-ingress
namespace: production
annotations:
kubernetes.io/ingress.class: nginx
cert-manager.io/cluster-issuer: letsencrypt-prod
nginx.ingress.kubernetes.io/rate-limit: "100"
nginx.ingress.kubernetes.io/ssl-redirect: "true"
spec:
tls:
- hosts:
- api.example.com
secretName: app-tls
rules:
- host: api.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: app-service
port:
number: 804. 基础设施即代码 (Terraform)
AWS 基础设施
hcl
# terraform/main.tf
terraform {
required_version = ">= 1.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
backend "s3" {
bucket = "terraform-state-bucket"
key = "prod/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-state-lock"
}
}
provider "aws" {
region = var.aws_region
}
# VPC 配置
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "5.0.0"
name = "${var.project_name}-vpc"
cidr = "10.0.0.0/16"
azs = ["${var.aws_region}a", "${var.aws_region}b", "${var.aws_region}c"]
private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
public_subnets = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]
enable_nat_gateway = true
enable_vpn_gateway = true
enable_dns_hostnames = true
tags = local.common_tags
}
# ECS 集群
resource "aws_ecs_cluster" "main" {
name = "${var.project_name}-cluster"
setting {
name = "containerInsights"
value = "enabled"
}
tags = local.common_tags
}
# ECS 任务定义
resource "aws_ecs_task_definition" "app" {
family = "${var.project_name}-app"
network_mode = "awsvpc"
requires_compatibilities = ["FARGATE"]
cpu = var.task_cpu
memory = var.task_memory
execution_role_arn = aws_iam_role.ecs_execution_role.arn
task_role_arn = aws_iam_role.ecs_task_role.arn
container_definitions = jsonencode([
{
name = "app"
image = "${var.ecr_repository_url}:${var.app_version}"
portMappings = [
{
containerPort = 3000
protocol = "tcp"
}
]
environment = [
{
name = "NODE_ENV"
value = var.environment
},
{
name = "PORT"
value = "3000"
}
]
secrets = [
{
name = "DATABASE_URL"
valueFrom = aws_secretsmanager_secret.db_connection.arn
},
{
name = "JWT_SECRET"
valueFrom = aws_secretsmanager_secret.jwt_secret.arn
}
]
logConfiguration = {
logDriver = "awslogs"
options = {
"awslogs-group" = aws_cloudwatch_log_group.ecs.name
"awslogs-region" = var.aws_region
"awslogs-stream-prefix" = "ecs"
}
}
healthCheck = {
command = ["CMD-SHELL", "curl -f http://localhost:3000/health || exit 1"]
interval = 30
timeout = 5
retries = 3
startPeriod = 60
}
}
])
tags = local.common_tags
}
# Application Load Balancer
resource "aws_lb" "main" {
name = "${var.project_name}-alb"
internal = false
load_balancer_type = "application"
security_groups = [aws_security_group.alb.id]
subnets = module.vpc.public_subnets
enable_deletion_protection = var.environment == "production"
enable_http2 = true
tags = local.common_tags
}
# ECS 服务
resource "aws_ecs_service" "app" {
name = "${var.project_name}-service"
cluster = aws_ecs_cluster.main.id
task_definition = aws_ecs_task_definition.app.arn
desired_count = var.app_count
launch_type = "FARGATE"
network_configuration {
security_groups = [aws_security_group.ecs_tasks.id]
subnets = module.vpc.private_subnets
assign_public_ip = false
}
load_balancer {
target_group_arn = aws_lb_target_group.app.arn
container_name = "app"
container_port = 3000
}
deployment_configuration {
maximum_percent = 200
minimum_healthy_percent = 100
deployment_circuit_breaker {
enable = true
rollback = true
}
}
depends_on = [
aws_lb_listener.https,
aws_iam_role_policy_attachment.ecs_task_execution_role
]
tags = local.common_tags
}
# Auto Scaling
resource "aws_appautoscaling_target" "ecs" {
max_capacity = var.max_capacity
min_capacity = var.min_capacity
resource_id = "service/${aws_ecs_cluster.main.name}/${aws_ecs_service.app.name}"
scalable_dimension = "ecs:service:DesiredCount"
service_namespace = "ecs"
}
resource "aws_appautoscaling_policy" "cpu" {
name = "${var.project_name}-cpu-autoscaling"
policy_type = "TargetTrackingScaling"
resource_id = aws_appautoscaling_target.ecs.resource_id
scalable_dimension = aws_appautoscaling_target.ecs.scalable_dimension
service_namespace = aws_appautoscaling_target.ecs.service_namespace
target_tracking_scaling_policy_configuration {
predefined_metric_specification {
predefined_metric_type = "ECSServiceAverageCPUUtilization"
}
target_value = 70.0
}
}
# RDS 数据库
resource "aws_db_instance" "postgres" {
identifier = "${var.project_name}-db"
engine = "postgres"
engine_version = "15.3"
instance_class = var.db_instance_type
allocated_storage = 100
max_allocated_storage = 1000
storage_type = "gp3"
storage_encrypted = true
db_name = var.db_name
username = var.db_username
password = random_password.db_password.result
vpc_security_group_ids = [aws_security_group.rds.id]
db_subnet_group_name = aws_db_subnet_group.main.name
backup_retention_period = 30
backup_window = "03:00-04:00"
maintenance_window = "sun:04:00-sun:05:00"
deletion_protection = var.environment == "production"
skip_final_snapshot = var.environment != "production"
performance_insights_enabled = true
monitoring_interval = 60
monitoring_role_arn = aws_iam_role.rds_monitoring.arn
tags = local.common_tags
}5. 监控和可观测性
Prometheus 配置
yaml
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'production'
region: 'us-east-1'
# 告警规则文件
rule_files:
- "alerts/*.yml"
# AlertManager 配置
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
# 监控目标
scrape_configs:
# 应用指标
- job_name: 'app'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
# Node Exporter
- job_name: 'node'
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics
# Kubernetes 指标
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
# 数据库指标
- job_name: 'postgres'
static_configs:
- targets: ['postgres-exporter:9187']
params:
auth_module: [postgres]告警规则
yaml
# alerts/app-alerts.yml
groups:
- name: app_alerts
interval: 30s
rules:
# 高错误率
- alert: HighErrorRate
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
) > 0.05
for: 5m
labels:
severity: critical
team: backend
annotations:
summary: "高错误率检测到"
description: "服务 {{ $labels.service }} 的错误率为 {{ $value | humanizePercentage }}"
# 高响应时间
- alert: HighResponseTime
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (service, le)
) > 1
for: 10m
labels:
severity: warning
team: backend
annotations:
summary: "响应时间过高"
description: "服务 {{ $labels.service }} 的 P95 响应时间为 {{ $value }}s"
# Pod 内存使用率
- alert: PodMemoryUsage
expr: |
(
container_memory_working_set_bytes{pod!="", container!=""}
/
container_spec_memory_limit_bytes{pod!="", container!=""}
) > 0.85
for: 5m
labels:
severity: warning
team: devops
annotations:
summary: "Pod 内存使用率过高"
description: "Pod {{ $labels.pod }} 的内存使用率为 {{ $value | humanizePercentage }}"
# 数据库连接池
- alert: DatabaseConnectionPoolExhausted
expr: |
pg_stat_database_numbackends{datname="production"}
/
pg_settings_max_connections > 0.8
for: 5m
labels:
severity: critical
team: backend
annotations:
summary: "数据库连接池即将耗尽"
description: "数据库连接使用率为 {{ $value | humanizePercentage }}"Grafana 仪表板配置
json
{
"dashboard": {
"title": "应用监控仪表板",
"panels": [
{
"title": "请求速率",
"targets": [
{
"expr": "sum(rate(http_requests_total[5m])) by (service)",
"legendFormat": "{{ service }}"
}
],
"type": "graph"
},
{
"title": "错误率",
"targets": [
{
"expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service)",
"legendFormat": "{{ service }}"
}
],
"type": "graph"
},
{
"title": "响应时间 (P95)",
"targets": [
{
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (service, le))",
"legendFormat": "{{ service }}"
}
],
"type": "graph"
},
{
"title": "Pod 资源使用",
"targets": [
{
"expr": "sum(container_memory_working_set_bytes{pod=~\"app-.*\"}) by (pod)",
"legendFormat": "{{ pod }} - Memory"
},
{
"expr": "sum(rate(container_cpu_usage_seconds_total{pod=~\"app-.*\"}[5m])) by (pod)",
"legendFormat": "{{ pod }} - CPU"
}
],
"type": "graph"
}
]
}
}使用技巧
1. 明确部署需求
bash
# 具体的部署场景
"部署 React 应用到 AWS ECS,需要自动扩展和 HTTPS"
# 指定技术栈
"使用 GitLab CI 部署 Node.js 应用到 Kubernetes"2. 提供环境信息
bash
# 环境规格
"生产环境需要高可用,至少 3 个实例"
# 现有基础设施
"已有 AWS VPC 和 RDS 数据库,需要部署应用层"3. 安全和合规要求
bash
# 安全要求
"需要符合 SOC2 合规,所有数据必须加密"
# 访问控制
"实现基于角色的访问控制和审计日志"最佳实践
1. 安全第一
- 最小权限原则
- 密钥管理使用专门服务
- 定期安全扫描
- 加密传输和存储
2. 自动化一切
- 基础设施即代码
- 自动化测试和部署
- 自动化监控和告警
- 自动化故障恢复
3. 可观测性
- 完整的日志收集
- 详细的指标监控
- 分布式追踪
- 错误追踪和分析
4. 高可用设计
- 多可用区部署
- 自动故障转移
- 定期灾备演练
- 回滚机制
相关资源
DevOps 工程师 - 让部署如丝般顺滑