Проект: {PROJECT_NAME}
Последнее обновление: {ДАТА}
| Метрика | Норма | Warning | Critical | Как проверить |
|---|---|---|---|---|
| Uptime | 99.9% | < 99% | < 95% | systemctl status {PROJECT} |
| Response time | < 500ms | 500-1000ms | > 1000ms | /health endpoint |
| Error rate | < 1% | 1-5% | > 5% | Логи errors/hour |
| Requests/sec | - | - | - | Access logs |
| Active users | - | - | - | Application logs |
| Метрика | Норма | Warning | Critical | Как проверить |
|---|---|---|---|---|
| CPU | < 70% | 70-85% | > 85% | top |
| Memory | < 80% | 80-90% | > 90% | free -h |
| Disk | < 80% | 80-90% | > 90% | df -h |
| Load average | < cores | cores-2x | > 2x cores | uptime |
| Disk I/O | < 80% | 80-95% | > 95% | iostat |
| Метрика | Норма | Warning | Critical | Как проверить |
|---|---|---|---|---|
| Connections | < 50 | 50-80 | > 80 | pg_stat_activity |
| Query time | < 100ms | 100-500ms | > 500ms | pg_stat_statements |
| DB size | - | Growing fast | > 90% disk | pg_database_size |
| Locks | 0 | 1-5 | > 5 | pg_locks |
| Cache hit ratio | > 99% | 95-99% | < 95% | pg_stat_database |
# app.py или routes/health.py
from fastapi import FastAPI, HTTPException
from sqlalchemy import text
import redis
app = FastAPI()
@app.get("/health")
def health_check():
"""
Health check endpoint
Returns:
- 200 OK: Всё работает
- 503 Service Unavailable: Есть проблемы
"""
health = {
"status": "ok",
"version": "1.0.0",
"checks": {}
}
# 1. Проверка БД
try:
db.execute(text("SELECT 1"))
health["checks"]["database"] = "ok"
except Exception as e:
health["checks"]["database"] = f"error: {str(e)}"
health["status"] = "degraded"
# 2. Проверка Redis (если используется)
try:
redis_client.ping()
health["checks"]["redis"] = "ok"
except Exception as e:
health["checks"]["redis"] = f"error: {str(e)}"
health["status"] = "degraded"
# 3. Проверка диска
import shutil
disk_usage = shutil.disk_usage("/")
disk_percent = (disk_usage.used / disk_usage.total) * 100
health["checks"]["disk"] = f"{disk_percent:.1f}% used"
if disk_percent > 90:
health["status"] = "critical"
# Вернуть 503 если проблемы
if health["status"] != "ok":
raise HTTPException(status_code=503, detail=health)
return health
Проверка:
curl http://localhost:{PORT}/health
# Ожидаем:
{
"status": "ok",
"version": "1.0.0",
"checks": {
"database": "ok",
"redis": "ok",
"disk": "45.2% used"
}
}
UptimeRobot / Pingdom / StatusCake:
URL: https://{DOMAIN}/health
Interval: 5 minutes
Alert if: Down for 5 minutes
Действие: Звонок дежурному + SMS
Действие: Email + Telegram
Действие: Только лог
#!/bin/bash
# scripts/check-metrics.sh
# CPU
CPU=$(top -bn1 | grep "Cpu(s)" | sed "s/.*, *\([0-9.]*\)%* id.*/\1/" | awk '{print 100 - $1}')
if (( $(echo "$CPU > 85" | bc -l) )); then
echo "🔴 CRITICAL: CPU $CPU%"
# Отправить алерт
fi
# Memory
MEM=$(free | grep Mem | awk '{print ($3/$2) * 100.0}')
if (( $(echo "$MEM > 90" | bc -l) )); then
echo "🟡 WARNING: Memory $MEM%"
fi
# Disk
DISK=$(df -h / | awk 'NR==2 {print $5}' | sed 's/%//')
if [ $DISK -gt 90 ]; then
echo "🔴 CRITICAL: Disk $DISK%"
fi
# Application health
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:{PORT}/health)
if [ $HTTP_CODE -ne 200 ]; then
echo "🔴 CRITICAL: Application health check failed (HTTP $HTTP_CODE)"
fi
Cron:
*/5 * * * * /opt/{PROJECT_NAME}/scripts/check-metrics.sh | logger -t metrics
#!/bin/bash
# scripts/telegram-alert.sh
BOT_TOKEN="{YOUR_BOT_TOKEN}"
CHAT_ID="{YOUR_CHAT_ID}"
MESSAGE=$1
curl -s -X POST "https://api.telegram.org/bot$BOT_TOKEN/sendMessage" \
-d chat_id=$CHAT_ID \
-d text="$MESSAGE" \
-d parse_mode="HTML"
Использование:
./scripts/telegram-alert.sh "🔴 <b>CRITICAL:</b> {PROJECT_NAME} is down!"
import logging
# Настройка
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
handlers=[
logging.FileHandler('/var/log/{PROJECT_NAME}/app.log'),
logging.StreamHandler()
]
)
logger = logging.getLogger(__name__)
# Использование
logger.debug("Detailed info for debugging")
logger.info("Order #12345 created")
logger.warning("Stock low for SKU-001")
logger.error("Failed to connect to Ozon API")
logger.critical("Database connection lost!")
# /etc/logrotate.d/{PROJECT_NAME}
/var/log/{PROJECT_NAME}/*.log {
daily
rotate 30
compress
delaycompress
notifempty
create 0640 {USER} {GROUP}
sharedscripts
postrotate
systemctl reload {PROJECT_NAME} > /dev/null 2>&1 || true
endscript
}
Syslog:
from logging.handlers import SysLogHandler
syslog = SysLogHandler(address='/dev/log')
logger.addHandler(syslog)
Graylog / ELK:
from pygelf import GelfUdpHandler
gelf_handler = GelfUdpHandler(host='graylog-server', port=12201)
logger.addHandler(gelf_handler)
<!-- dashboard.html -->
<!DOCTYPE html>
<html>
<head>
<title>{PROJECT_NAME} Monitoring</title>
<meta http-equiv="refresh" content="30">
<style>
.metric { display: inline-block; margin: 20px; padding: 15px; border: 1px solid #ccc; }
.ok { background: #d4edda; }
.warning { background: #fff3cd; }
.critical { background: #f8d7da; }
</style>
</head>
<body>
<h1>{PROJECT_NAME} Status</h1>
<div id="status" class="metric">
<h3>Application</h3>
<p id="app-status">Loading...</p>
</div>
<div id="cpu" class="metric">
<h3>CPU</h3>
<p id="cpu-value">Loading...</p>
</div>
<script>
// Fetch /health endpoint
fetch('/health')
.then(r => r.json())
.then(data => {
document.getElementById('app-status').textContent = data.status;
});
// Fetch /metrics endpoint
fetch('/metrics')
.then(r => r.json())
.then(data => {
document.getElementById('cpu-value').textContent = data.cpu + '%';
});
</script>
</body>
</html>
# grafana-dashboard.json
{
"dashboard": {
"title": "{PROJECT_NAME} Metrics",
"panels": [
{
"title": "CPU Usage",
"targets": [{
"expr": "node_cpu_usage"
}]
},
{
"title": "Memory Usage",
"targets": [{
"expr": "node_memory_usage"
}]
},
{
"title": "Response Time",
"targets": [{
"expr": "http_request_duration_seconds"
}]
}
]
}
}
Последнее обновление: {ДАТА}
Ответственный за мониторинг: {ИМЯ}