infra: logging and monitoring setup

This commit is contained in:
Claude Code
2026-03-16 17:32:28 +01:00
parent c67561e047
commit a7a541aac5
2 changed files with 479 additions and 0 deletions

View File

@@ -0,0 +1,24 @@
services:
n8n:
environment:
- N8N_LOG_LEVEL=debug
- N8N_LOG_OUTPUT=stdout
logging:
driver: "json-file"
options:
max-size: "100m"
max-file: "10"
postgres:
logging:
driver: "json-file"
options:
max-size: "100m"
max-file: "10"
milvus:
logging:
driver: "json-file"
options:
max-size: "100m"
max-file: "10"

455
docs/MONITORING.md Normal file
View File

@@ -0,0 +1,455 @@
# Monitoring & Logging Setup
## Overview
This document provides comprehensive monitoring and logging guidelines for the n8n AI Support Automation system. It includes key metrics, troubleshooting procedures, and log inspection commands.
## Key Metrics
### 1. Mail Processing Rate (Workflow A)
**Description:** Track the number of conversations processed through the system.
**N8N Logs:**
```bash
docker-compose logs -f n8n | grep "processed"
```
**PostgreSQL Query:**
```sql
SELECT COUNT(*) as total_executions,
COUNT(CASE WHEN status = 'success' THEN 1 END) as successful_executions,
ROUND(100.0 * COUNT(CASE WHEN status = 'success' THEN 1 END) / COUNT(*), 2) as success_rate
FROM workflow_executions
WHERE workflow_name = 'workflow-a';
```
**Expected Behavior:**
- Consistent processing rate (depends on Freescout mail polling interval)
- Success rate > 95%
- Monitor for sudden drops in processing rate
---
### 2. Approval Rate (Workflow B)
**Description:** Monitor the ratio of approved vs rejected KB updates from the AI suggestions.
**PostgreSQL Query:**
```sql
SELECT status, COUNT(*) as count,
ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER (), 2) as percentage
FROM knowledge_base_updates
GROUP BY status
ORDER BY count DESC;
```
**Alternative Query for detailed breakdown:**
```sql
SELECT
status,
COUNT(*) as count,
AVG(EXTRACT(EPOCH FROM (updated_at - created_at))) as avg_approval_time_seconds
FROM knowledge_base_updates
GROUP BY status;
```
**Expected Behavior:**
- Majority of updates should be APPROVED (typically 70-90%)
- REJECTED rate should be < 15%
- PENDING updates should be resolved within 24 hours
---
### 3. KB Growth (Workflow C)
**Description:** Track the growth of the knowledge base as new information is added.
**Milvus Query:**
```bash
# First, connect to Milvus
docker-compose exec milvus python3 -c "
from pymilvus import connections, Collection
connections.connect('default', host='localhost', port=19530)
collection = Collection('knowledge_base')
print(f'Total vectors: {collection.num_entities}')
"
```
**PostgreSQL Query for tracking:**
```sql
SELECT COUNT(*) as total_entries,
COUNT(DISTINCT source) as unique_sources,
MAX(created_at) as latest_entry
FROM knowledge_base
WHERE status = 'approved';
```
**Daily Growth Query:**
```sql
SELECT DATE(created_at) as date, COUNT(*) as entries_added
FROM knowledge_base
WHERE status = 'approved'
GROUP BY DATE(created_at)
ORDER BY date DESC
LIMIT 30;
```
**Expected Behavior:**
- +1 vector per approved ticket (approximately)
- Steady growth correlates with approved KB updates
- Monitor for stalled growth (may indicate Milvus issues)
---
### 4. Error Rate
**Description:** Monitor workflow execution errors across all workflows.
**PostgreSQL Query - Overall Error Rate:**
```sql
SELECT
COUNT(*) as total_executions,
COUNT(CASE WHEN status = 'ERROR' THEN 1 END) as error_count,
ROUND(100.0 * COUNT(CASE WHEN status = 'ERROR' THEN 1 END) / COUNT(*), 2) as error_percentage
FROM workflow_executions;
```
**Detailed Error Analysis:**
```sql
SELECT
workflow_name,
status,
COUNT(*) as count,
ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER (PARTITION BY workflow_name), 2) as percentage
FROM workflow_executions
GROUP BY workflow_name, status
ORDER BY workflow_name, error_count DESC;
```
**Error Details for Investigation:**
```sql
SELECT
workflow_name,
status,
error_message,
COUNT(*) as occurrences,
MAX(executed_at) as latest_error
FROM workflow_executions
WHERE status = 'ERROR'
GROUP BY workflow_name, status, error_message
ORDER BY occurrences DESC;
```
**Expected Behavior:**
- Error rate < 5%
- No recurring errors (indicates systemic issue)
- Quick recovery from transient errors
---
## Troubleshooting Guide
### Workflow A (Mail Processing) - Not Running
**Symptoms:**
- No new conversations being processed
- N8N logs show no activity
- PostgreSQL query returns unchanged row count
**Troubleshooting Steps:**
1. **Check if workflow trigger is active:**
```bash
docker-compose logs -f n8n | grep "workflow-a"
```
2. **Verify Cron trigger configuration:**
- Log into n8n UI at `https://<SUBDOMAIN>.<DOMAIN>`
- Navigate to workflow-a
- Check cron expression (typically: `0 */5 * * * *` for every 5 minutes)
- Verify "Active" toggle is ON
3. **Test Freescout API credentials:**
```bash
docker-compose exec n8n curl -X GET \
-H "Authorization: Bearer ${FREESCOUT_API_TOKEN}" \
https://<freescout-instance>/api/v1/conversations
```
4. **Check Freescout API reachability:**
```bash
docker-compose exec n8n ping <freescout-instance>
docker-compose exec n8n curl -I https://<freescout-instance>/api/v1/health
```
5. **Review n8n logs for errors:**
```bash
docker-compose logs n8n | grep -i "error\|exception" | tail -20
```
6. **Verify PostgreSQL connection:**
```bash
docker-compose logs n8n | grep -i "database\|postgres"
```
---
### Workflow B (AI Suggestions) - Not Triggering
**Symptoms:**
- No new AI suggestions in Freescout
- workflow_executions table shows no recent B entries
- knowledge_base_updates status stuck in PENDING
**Troubleshooting Steps:**
1. **Check if Freescout custom field is being updated:**
```sql
SELECT * FROM freescout_conversation_custom_fields
WHERE field_name = 'AI_SUGGESTION_STATUS'
ORDER BY updated_at DESC
LIMIT 10;
```
2. **Verify polling interval:**
- Check n8n workflow B settings
- Polling trigger should be running (typically every 1 minute)
- Confirm: `docker-compose logs n8n | grep -i "polling\|workflow-b"`
3. **Check webhook configuration:**
```bash
# If using webhook instead of polling
docker-compose logs -f n8n | grep -i "webhook"
```
4. **Review Freescout API response:**
```bash
docker-compose exec postgres psql -U kb_user -d n8n_kb -c \
"SELECT * FROM api_logs WHERE endpoint LIKE '%conversation%' ORDER BY timestamp DESC LIMIT 5;"
```
5. **Verify OpenAI/AI provider connectivity:**
```bash
docker-compose logs n8n | grep -i "openai\|api\|llm" | tail -20
```
6. **Check if there are unprocessed conversations:**
```sql
SELECT COUNT(*) as pending_conversations
FROM workflow_executions
WHERE workflow_name = 'workflow-a'
AND status = 'success'
AND ai_suggestion_generated = false
AND created_at > NOW() - INTERVAL '1 hour';
```
---
### Workflow C (KB Storage) - Not Saving to Milvus
**Symptoms:**
- knowledge_base table updates but Milvus count doesn't increase
- KB search returns no results
- Milvus health check failures
**Troubleshooting Steps:**
1. **Check Milvus health status:**
```bash
docker-compose exec milvus curl -s http://localhost:9091/healthz | jq .
```
2. **Verify Milvus is running:**
```bash
docker-compose ps milvus
docker-compose logs milvus | tail -30
```
3. **Check if embeddings are being generated:**
```sql
SELECT COUNT(*) as embeddings_generated
FROM knowledge_base
WHERE embedding IS NOT NULL;
```
4. **Verify Milvus connection in n8n logs:**
```bash
docker-compose logs n8n | grep -i "milvus\|embedding" | tail -20
```
5. **Test Milvus directly:**
```bash
docker-compose exec milvus python3 << 'EOF'
from pymilvus import connections, Collection
connections.connect('default', host='localhost', port=19530)
try:
collection = Collection('knowledge_base')
print(f'✓ Milvus connected, collection entities: {collection.num_entities}')
except Exception as e:
print(f'✗ Milvus error: {e}')
EOF
```
6. **Check for rate limiting or connection timeouts:**
```bash
docker-compose logs n8n | grep -i "timeout\|connection\|refused" | tail -20
```
7. **Verify vector dimension matches:**
- Check embedding model (should match Milvus collection definition)
- Default: 1536 dimensions (OpenAI embeddings)
```sql
SELECT vector_dimension FROM milvus_schema WHERE collection_name = 'knowledge_base';
```
---
## Logs & Debugging Commands
### View Real-time Logs
**N8N Logs:**
```bash
# All n8n logs
docker-compose logs -f n8n
# Follow specific keywords
docker-compose logs -f n8n | grep -i "error\|workflow\|processed"
# Last 100 lines
docker-compose logs --tail 100 n8n
```
**PostgreSQL Logs:**
```bash
# View recent PostgreSQL operations
docker-compose logs -f postgres
# Check database activity
docker-compose exec postgres psql -U kb_user -d n8n_kb -c \
"SELECT now(), datname, usename, state FROM pg_stat_activity;"
```
**Milvus Logs:**
```bash
# View Milvus startup and operation logs
docker-compose logs -f milvus
# Check Milvus status
docker-compose exec milvus curl -s http://localhost:9091/healthz
```
### Database Inspection
**Recent Workflow Executions:**
```bash
docker-compose exec postgres psql -U kb_user -d n8n_kb -c \
"SELECT workflow_name, status, executed_at, error_message FROM workflow_executions ORDER BY executed_at DESC LIMIT 10;"
```
**KB Updates Status:**
```bash
docker-compose exec postgres psql -U kb_user -d n8n_kb -c \
"SELECT status, COUNT(*) FROM knowledge_base_updates GROUP BY status;"
```
**Last 24h Activity:**
```bash
docker-compose exec postgres psql -U kb_user -d n8n_kb -c \
"SELECT DATE(executed_at) as date, workflow_name, status, COUNT(*) as count
FROM workflow_executions
WHERE executed_at > NOW() - INTERVAL '24 hours'
GROUP BY DATE(executed_at), workflow_name, status
ORDER BY date DESC, workflow_name;"
```
### Performance Monitoring
**PostgreSQL Connection Count:**
```bash
docker-compose exec postgres psql -U kb_user -d n8n_kb -c \
"SELECT count(*) as connections FROM pg_stat_activity;"
```
**PostgreSQL Cache Hit Ratio:**
```bash
docker-compose exec postgres psql -U kb_user -d n8n_kb -c \
"SELECT sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) as ratio
FROM pg_statio_user_tables;"
```
**Disk Usage:**
```bash
docker-compose exec postgres psql -U kb_user -d n8n_kb -c \
"SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename))
FROM pg_tables
WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;"
```
### Debugging Network Issues
**Test connectivity between services:**
```bash
# From n8n to PostgreSQL
docker-compose exec n8n ping postgres
# From n8n to Milvus
docker-compose exec n8n curl -v http://milvus:19530/api/v1/health
# From n8n to Freescout
docker-compose exec n8n ping <freescout-host>
```
---
## Alert Thresholds
Configure monitoring/alerting for these conditions:
| Metric | Threshold | Action |
|--------|-----------|--------|
| Error Rate | > 5% | Page on-call, review workflow logs |
| KB Growth Stalled | 0 entries in 4 hours | Check Milvus health and embeddings |
| Approval Rate | < 50% | Review AI suggestion quality |
| Processing Rate | Drop > 50% | Check Freescout connection |
| Milvus Health | Not healthy | Restart Milvus, check etcd/minio |
| PostgreSQL Connections | > 80% of max | Investigate connection leaks |
---
## Regular Maintenance
### Daily
- [ ] Check error rate < 5%
- [ ] Verify KB growth is progressing
- [ ] Review Freescout API response times
### Weekly
- [ ] Analyze approval rate trends
- [ ] Check PostgreSQL disk usage
- [ ] Review n8n workflow performance
### Monthly
- [ ] Full system health audit
- [ ] Database maintenance (VACUUM, ANALYZE)
- [ ] Log rotation verification
- [ ] Capacity planning review
---
## Version Information
- **n8n**: Latest from `docker.n8n.io/n8nio/n8n`
- **PostgreSQL**: 15-alpine
- **Milvus**: v2.4.0
- **Logging Driver**: json-file with max 100MB per file, 10 files rotation
## Contact & Escalation
For issues not resolved by this guide:
1. Collect logs: `docker-compose logs > system_logs.txt`
2. Export database state for analysis
3. Contact DevOps team with reproducible steps