architect/_archive/2025-11-cleanup/templates/MONITORING.template.md

📊 MONITORING - Мониторинг и алерты

Проект: {PROJECT_NAME}
Последнее обновление: {ДАТА}


📑 СОДЕРЖАНИЕ

  1. Ключевые метрики
  2. Health checks
  3. Алерты
  4. Логирование
  5. Dashboards
  6. Инструменты

📈 КЛЮЧЕВЫЕ МЕТРИКИ {#ключевые-метрики}

Application метрики

Метрика Норма Warning Critical Как проверить
Uptime 99.9% < 99% < 95% systemctl status {PROJECT}
Response time < 500ms 500-1000ms > 1000ms /health endpoint
Error rate < 1% 1-5% > 5% Логи errors/hour
Requests/sec - - - Access logs
Active users - - - Application logs

System метрики

Метрика Норма Warning Critical Как проверить
CPU < 70% 70-85% > 85% top
Memory < 80% 80-90% > 90% free -h
Disk < 80% 80-90% > 90% df -h
Load average < cores cores-2x > 2x cores uptime
Disk I/O < 80% 80-95% > 95% iostat

Database метрики

Метрика Норма Warning Critical Как проверить
Connections < 50 50-80 > 80 pg_stat_activity
Query time < 100ms 100-500ms > 500ms pg_stat_statements
DB size - Growing fast > 90% disk pg_database_size
Locks 0 1-5 > 5 pg_locks
Cache hit ratio > 99% 95-99% < 95% pg_stat_database

💚 HEALTH CHECKS {#health-checks}

HTTP Health Endpoint

# app.py или routes/health.py

from fastapi import FastAPI, HTTPException
from sqlalchemy import text
import redis

app = FastAPI()

@app.get("/health")
def health_check():
    """
    Health check endpoint

    Returns:
    - 200 OK: Всё работает
    - 503 Service Unavailable: Есть проблемы
    """
    health = {
        "status": "ok",
        "version": "1.0.0",
        "checks": {}
    }

    # 1. Проверка БД
    try:
        db.execute(text("SELECT 1"))
        health["checks"]["database"] = "ok"
    except Exception as e:
        health["checks"]["database"] = f"error: {str(e)}"
        health["status"] = "degraded"

    # 2. Проверка Redis (если используется)
    try:
        redis_client.ping()
        health["checks"]["redis"] = "ok"
    except Exception as e:
        health["checks"]["redis"] = f"error: {str(e)}"
        health["status"] = "degraded"

    # 3. Проверка диска
    import shutil
    disk_usage = shutil.disk_usage("/")
    disk_percent = (disk_usage.used / disk_usage.total) * 100
    health["checks"]["disk"] = f"{disk_percent:.1f}% used"

    if disk_percent > 90:
        health["status"] = "critical"

    # Вернуть 503 если проблемы
    if health["status"] != "ok":
        raise HTTPException(status_code=503, detail=health)

    return health

Проверка:

curl http://localhost:{PORT}/health

# Ожидаем:
{
  "status": "ok",
  "version": "1.0.0",
  "checks": {
    "database": "ok",
    "redis": "ok",
    "disk": "45.2% used"
  }
}

Внешний мониторинг (Uptime checks)

UptimeRobot / Pingdom / StatusCake:

URL: https://{DOMAIN}/health
Interval: 5 minutes
Alert if: Down for 5 minutes

🚨 АЛЕРТЫ {#алерты}

Уровни алертов

🔴 CRITICAL (немедленная реакция)

Действие: Звонок дежурному + SMS

🟡 WARNING (в течение часа)

Действие: Email + Telegram

🔵 INFO (мониторить)

Действие: Только лог

Скрипт проверки метрик

#!/bin/bash
# scripts/check-metrics.sh

# CPU
CPU=$(top -bn1 | grep "Cpu(s)" | sed "s/.*, *\([0-9.]*\)%* id.*/\1/" | awk '{print 100 - $1}')
if (( $(echo "$CPU > 85" | bc -l) )); then
    echo "🔴 CRITICAL: CPU $CPU%"
    # Отправить алерт
fi

# Memory
MEM=$(free | grep Mem | awk '{print ($3/$2) * 100.0}')
if (( $(echo "$MEM > 90" | bc -l) )); then
    echo "🟡 WARNING: Memory $MEM%"
fi

# Disk
DISK=$(df -h / | awk 'NR==2 {print $5}' | sed 's/%//')
if [ $DISK -gt 90 ]; then
    echo "🔴 CRITICAL: Disk $DISK%"
fi

# Application health
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:{PORT}/health)
if [ $HTTP_CODE -ne 200 ]; then
    echo "🔴 CRITICAL: Application health check failed (HTTP $HTTP_CODE)"
fi

Cron:

*/5 * * * * /opt/{PROJECT_NAME}/scripts/check-metrics.sh | logger -t metrics

Telegram алерты

#!/bin/bash
# scripts/telegram-alert.sh

BOT_TOKEN="{YOUR_BOT_TOKEN}"
CHAT_ID="{YOUR_CHAT_ID}"
MESSAGE=$1

curl -s -X POST "https://api.telegram.org/bot$BOT_TOKEN/sendMessage" \
  -d chat_id=$CHAT_ID \
  -d text="$MESSAGE" \
  -d parse_mode="HTML"

Использование:

./scripts/telegram-alert.sh "🔴 <b>CRITICAL:</b> {PROJECT_NAME} is down!"

📝 ЛОГИРОВАНИЕ {#логирование}

Уровни логов

import logging

# Настройка
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('/var/log/{PROJECT_NAME}/app.log'),
        logging.StreamHandler()
    ]
)

logger = logging.getLogger(__name__)

# Использование
logger.debug("Detailed info for debugging")
logger.info("Order #12345 created")
logger.warning("Stock low for SKU-001")
logger.error("Failed to connect to Ozon API")
logger.critical("Database connection lost!")

Ротация логов

# /etc/logrotate.d/{PROJECT_NAME}

/var/log/{PROJECT_NAME}/*.log {
    daily
    rotate 30
    compress
    delaycompress
    notifempty
    create 0640 {USER} {GROUP}
    sharedscripts
    postrotate
        systemctl reload {PROJECT_NAME} > /dev/null 2>&1 || true
    endscript
}

Централизованные логи

Syslog:

from logging.handlers import SysLogHandler

syslog = SysLogHandler(address='/dev/log')
logger.addHandler(syslog)

Graylog / ELK:

from pygelf import GelfUdpHandler

gelf_handler = GelfUdpHandler(host='graylog-server', port=12201)
logger.addHandler(gelf_handler)

📊 DASHBOARDS {#dashboards}

Simple HTML Dashboard

<!-- dashboard.html -->
<!DOCTYPE html>
<html>
<head>
    <title>{PROJECT_NAME} Monitoring</title>
    <meta http-equiv="refresh" content="30">
    <style>
        .metric { display: inline-block; margin: 20px; padding: 15px; border: 1px solid #ccc; }
        .ok { background: #d4edda; }
        .warning { background: #fff3cd; }
        .critical { background: #f8d7da; }
    </style>
</head>
<body>
    <h1>{PROJECT_NAME} Status</h1>

    <div id="status" class="metric">
        <h3>Application</h3>
        <p id="app-status">Loading...</p>
    </div>

    <div id="cpu" class="metric">
        <h3>CPU</h3>
        <p id="cpu-value">Loading...</p>
    </div>

    <script>
        // Fetch /health endpoint
        fetch('/health')
            .then(r => r.json())
            .then(data => {
                document.getElementById('app-status').textContent = data.status;
            });

        // Fetch /metrics endpoint
        fetch('/metrics')
            .then(r => r.json())
            .then(data => {
                document.getElementById('cpu-value').textContent = data.cpu + '%';
            });
    </script>
</body>
</html>

Grafana Dashboard (если используется)

# grafana-dashboard.json
{
  "dashboard": {
    "title": "{PROJECT_NAME} Metrics",
    "panels": [
      {
        "title": "CPU Usage",
        "targets": [{
          "expr": "node_cpu_usage"
        }]
      },
      {
        "title": "Memory Usage",
        "targets": [{
          "expr": "node_memory_usage"
        }]
      },
      {
        "title": "Response Time",
        "targets": [{
          "expr": "http_request_duration_seconds"
        }]
      }
    ]
  }
}

🛠️ ИНСТРУМЕНТЫ {#инструменты}

Минимальный стек (бесплатный)

  1. Uptime monitoring: UptimeRobot (бесплатно до 50 мониторов)
  2. Логи: systemd journal + logrotate
  3. Алерты: Telegram bot
  4. Метрики: Простые bash скрипты + cron

Средний стек

  1. Metrics: Prometheus + Node Exporter
  2. Visualization: Grafana
  3. Logs: Loki или ELK stack
  4. Alerting: Prometheus Alertmanager
  5. Uptime: StatusCake

Enterprise стек

  1. APM: New Relic / Datadog
  2. Logs: Splunk / Datadog
  3. Tracing: Jaeger
  4. Incident management: PagerDuty / Opsgenie

📋 MONITORING CHECKLIST

Setup (один раз)

Ежедневно

Еженедельно

Ежемесячно


Последнее обновление: {ДАТА}
Ответственный за мониторинг: {ИМЯ}