From a7a541aac5e46233613d244b36aa0c74b671fe5f Mon Sep 17 00:00:00 2001 From: Claude Code Date: Mon, 16 Mar 2026 17:32:28 +0100 Subject: [PATCH] infra: logging and monitoring setup --- docker-compose.override.yml | 24 ++ docs/MONITORING.md | 455 ++++++++++++++++++++++++++++++++++++ 2 files changed, 479 insertions(+) create mode 100644 docker-compose.override.yml create mode 100644 docs/MONITORING.md diff --git a/docker-compose.override.yml b/docker-compose.override.yml new file mode 100644 index 0000000..bdd43dc --- /dev/null +++ b/docker-compose.override.yml @@ -0,0 +1,24 @@ +services: + n8n: + environment: + - N8N_LOG_LEVEL=debug + - N8N_LOG_OUTPUT=stdout + logging: + driver: "json-file" + options: + max-size: "100m" + max-file: "10" + + postgres: + logging: + driver: "json-file" + options: + max-size: "100m" + max-file: "10" + + milvus: + logging: + driver: "json-file" + options: + max-size: "100m" + max-file: "10" diff --git a/docs/MONITORING.md b/docs/MONITORING.md new file mode 100644 index 0000000..b140530 --- /dev/null +++ b/docs/MONITORING.md @@ -0,0 +1,455 @@ +# Monitoring & Logging Setup + +## Overview + +This document provides comprehensive monitoring and logging guidelines for the n8n AI Support Automation system. It includes key metrics, troubleshooting procedures, and log inspection commands. + +## Key Metrics + +### 1. Mail Processing Rate (Workflow A) + +**Description:** Track the number of conversations processed through the system. + +**N8N Logs:** +```bash +docker-compose logs -f n8n | grep "processed" +``` + +**PostgreSQL Query:** +```sql +SELECT COUNT(*) as total_executions, + COUNT(CASE WHEN status = 'success' THEN 1 END) as successful_executions, + ROUND(100.0 * COUNT(CASE WHEN status = 'success' THEN 1 END) / COUNT(*), 2) as success_rate +FROM workflow_executions +WHERE workflow_name = 'workflow-a'; +``` + +**Expected Behavior:** +- Consistent processing rate (depends on Freescout mail polling interval) +- Success rate > 95% +- Monitor for sudden drops in processing rate + +--- + +### 2. Approval Rate (Workflow B) + +**Description:** Monitor the ratio of approved vs rejected KB updates from the AI suggestions. + +**PostgreSQL Query:** +```sql +SELECT status, COUNT(*) as count, + ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER (), 2) as percentage +FROM knowledge_base_updates +GROUP BY status +ORDER BY count DESC; +``` + +**Alternative Query for detailed breakdown:** +```sql +SELECT + status, + COUNT(*) as count, + AVG(EXTRACT(EPOCH FROM (updated_at - created_at))) as avg_approval_time_seconds +FROM knowledge_base_updates +GROUP BY status; +``` + +**Expected Behavior:** +- Majority of updates should be APPROVED (typically 70-90%) +- REJECTED rate should be < 15% +- PENDING updates should be resolved within 24 hours + +--- + +### 3. KB Growth (Workflow C) + +**Description:** Track the growth of the knowledge base as new information is added. + +**Milvus Query:** +```bash +# First, connect to Milvus +docker-compose exec milvus python3 -c " +from pymilvus import connections, Collection + +connections.connect('default', host='localhost', port=19530) +collection = Collection('knowledge_base') +print(f'Total vectors: {collection.num_entities}') +" +``` + +**PostgreSQL Query for tracking:** +```sql +SELECT COUNT(*) as total_entries, + COUNT(DISTINCT source) as unique_sources, + MAX(created_at) as latest_entry +FROM knowledge_base +WHERE status = 'approved'; +``` + +**Daily Growth Query:** +```sql +SELECT DATE(created_at) as date, COUNT(*) as entries_added +FROM knowledge_base +WHERE status = 'approved' +GROUP BY DATE(created_at) +ORDER BY date DESC +LIMIT 30; +``` + +**Expected Behavior:** +- +1 vector per approved ticket (approximately) +- Steady growth correlates with approved KB updates +- Monitor for stalled growth (may indicate Milvus issues) + +--- + +### 4. Error Rate + +**Description:** Monitor workflow execution errors across all workflows. + +**PostgreSQL Query - Overall Error Rate:** +```sql +SELECT + COUNT(*) as total_executions, + COUNT(CASE WHEN status = 'ERROR' THEN 1 END) as error_count, + ROUND(100.0 * COUNT(CASE WHEN status = 'ERROR' THEN 1 END) / COUNT(*), 2) as error_percentage +FROM workflow_executions; +``` + +**Detailed Error Analysis:** +```sql +SELECT + workflow_name, + status, + COUNT(*) as count, + ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER (PARTITION BY workflow_name), 2) as percentage +FROM workflow_executions +GROUP BY workflow_name, status +ORDER BY workflow_name, error_count DESC; +``` + +**Error Details for Investigation:** +```sql +SELECT + workflow_name, + status, + error_message, + COUNT(*) as occurrences, + MAX(executed_at) as latest_error +FROM workflow_executions +WHERE status = 'ERROR' +GROUP BY workflow_name, status, error_message +ORDER BY occurrences DESC; +``` + +**Expected Behavior:** +- Error rate < 5% +- No recurring errors (indicates systemic issue) +- Quick recovery from transient errors + +--- + +## Troubleshooting Guide + +### Workflow A (Mail Processing) - Not Running + +**Symptoms:** +- No new conversations being processed +- N8N logs show no activity +- PostgreSQL query returns unchanged row count + +**Troubleshooting Steps:** + +1. **Check if workflow trigger is active:** + ```bash + docker-compose logs -f n8n | grep "workflow-a" + ``` + +2. **Verify Cron trigger configuration:** + - Log into n8n UI at `https://.` + - Navigate to workflow-a + - Check cron expression (typically: `0 */5 * * * *` for every 5 minutes) + - Verify "Active" toggle is ON + +3. **Test Freescout API credentials:** + ```bash + docker-compose exec n8n curl -X GET \ + -H "Authorization: Bearer ${FREESCOUT_API_TOKEN}" \ + https:///api/v1/conversations + ``` + +4. **Check Freescout API reachability:** + ```bash + docker-compose exec n8n ping + docker-compose exec n8n curl -I https:///api/v1/health + ``` + +5. **Review n8n logs for errors:** + ```bash + docker-compose logs n8n | grep -i "error\|exception" | tail -20 + ``` + +6. **Verify PostgreSQL connection:** + ```bash + docker-compose logs n8n | grep -i "database\|postgres" + ``` + +--- + +### Workflow B (AI Suggestions) - Not Triggering + +**Symptoms:** +- No new AI suggestions in Freescout +- workflow_executions table shows no recent B entries +- knowledge_base_updates status stuck in PENDING + +**Troubleshooting Steps:** + +1. **Check if Freescout custom field is being updated:** + ```sql + SELECT * FROM freescout_conversation_custom_fields + WHERE field_name = 'AI_SUGGESTION_STATUS' + ORDER BY updated_at DESC + LIMIT 10; + ``` + +2. **Verify polling interval:** + - Check n8n workflow B settings + - Polling trigger should be running (typically every 1 minute) + - Confirm: `docker-compose logs n8n | grep -i "polling\|workflow-b"` + +3. **Check webhook configuration:** + ```bash + # If using webhook instead of polling + docker-compose logs -f n8n | grep -i "webhook" + ``` + +4. **Review Freescout API response:** + ```bash + docker-compose exec postgres psql -U kb_user -d n8n_kb -c \ + "SELECT * FROM api_logs WHERE endpoint LIKE '%conversation%' ORDER BY timestamp DESC LIMIT 5;" + ``` + +5. **Verify OpenAI/AI provider connectivity:** + ```bash + docker-compose logs n8n | grep -i "openai\|api\|llm" | tail -20 + ``` + +6. **Check if there are unprocessed conversations:** + ```sql + SELECT COUNT(*) as pending_conversations + FROM workflow_executions + WHERE workflow_name = 'workflow-a' + AND status = 'success' + AND ai_suggestion_generated = false + AND created_at > NOW() - INTERVAL '1 hour'; + ``` + +--- + +### Workflow C (KB Storage) - Not Saving to Milvus + +**Symptoms:** +- knowledge_base table updates but Milvus count doesn't increase +- KB search returns no results +- Milvus health check failures + +**Troubleshooting Steps:** + +1. **Check Milvus health status:** + ```bash + docker-compose exec milvus curl -s http://localhost:9091/healthz | jq . + ``` + +2. **Verify Milvus is running:** + ```bash + docker-compose ps milvus + docker-compose logs milvus | tail -30 + ``` + +3. **Check if embeddings are being generated:** + ```sql + SELECT COUNT(*) as embeddings_generated + FROM knowledge_base + WHERE embedding IS NOT NULL; + ``` + +4. **Verify Milvus connection in n8n logs:** + ```bash + docker-compose logs n8n | grep -i "milvus\|embedding" | tail -20 + ``` + +5. **Test Milvus directly:** + ```bash + docker-compose exec milvus python3 << 'EOF' + from pymilvus import connections, Collection + connections.connect('default', host='localhost', port=19530) + try: + collection = Collection('knowledge_base') + print(f'✓ Milvus connected, collection entities: {collection.num_entities}') + except Exception as e: + print(f'✗ Milvus error: {e}') + EOF + ``` + +6. **Check for rate limiting or connection timeouts:** + ```bash + docker-compose logs n8n | grep -i "timeout\|connection\|refused" | tail -20 + ``` + +7. **Verify vector dimension matches:** + - Check embedding model (should match Milvus collection definition) + - Default: 1536 dimensions (OpenAI embeddings) + ```sql + SELECT vector_dimension FROM milvus_schema WHERE collection_name = 'knowledge_base'; + ``` + +--- + +## Logs & Debugging Commands + +### View Real-time Logs + +**N8N Logs:** +```bash +# All n8n logs +docker-compose logs -f n8n + +# Follow specific keywords +docker-compose logs -f n8n | grep -i "error\|workflow\|processed" + +# Last 100 lines +docker-compose logs --tail 100 n8n +``` + +**PostgreSQL Logs:** +```bash +# View recent PostgreSQL operations +docker-compose logs -f postgres + +# Check database activity +docker-compose exec postgres psql -U kb_user -d n8n_kb -c \ + "SELECT now(), datname, usename, state FROM pg_stat_activity;" +``` + +**Milvus Logs:** +```bash +# View Milvus startup and operation logs +docker-compose logs -f milvus + +# Check Milvus status +docker-compose exec milvus curl -s http://localhost:9091/healthz +``` + +### Database Inspection + +**Recent Workflow Executions:** +```bash +docker-compose exec postgres psql -U kb_user -d n8n_kb -c \ + "SELECT workflow_name, status, executed_at, error_message FROM workflow_executions ORDER BY executed_at DESC LIMIT 10;" +``` + +**KB Updates Status:** +```bash +docker-compose exec postgres psql -U kb_user -d n8n_kb -c \ + "SELECT status, COUNT(*) FROM knowledge_base_updates GROUP BY status;" +``` + +**Last 24h Activity:** +```bash +docker-compose exec postgres psql -U kb_user -d n8n_kb -c \ + "SELECT DATE(executed_at) as date, workflow_name, status, COUNT(*) as count + FROM workflow_executions + WHERE executed_at > NOW() - INTERVAL '24 hours' + GROUP BY DATE(executed_at), workflow_name, status + ORDER BY date DESC, workflow_name;" +``` + +### Performance Monitoring + +**PostgreSQL Connection Count:** +```bash +docker-compose exec postgres psql -U kb_user -d n8n_kb -c \ + "SELECT count(*) as connections FROM pg_stat_activity;" +``` + +**PostgreSQL Cache Hit Ratio:** +```bash +docker-compose exec postgres psql -U kb_user -d n8n_kb -c \ + "SELECT sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) as ratio + FROM pg_statio_user_tables;" +``` + +**Disk Usage:** +```bash +docker-compose exec postgres psql -U kb_user -d n8n_kb -c \ + "SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) + FROM pg_tables + WHERE schemaname NOT IN ('pg_catalog', 'information_schema') + ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;" +``` + +### Debugging Network Issues + +**Test connectivity between services:** +```bash +# From n8n to PostgreSQL +docker-compose exec n8n ping postgres + +# From n8n to Milvus +docker-compose exec n8n curl -v http://milvus:19530/api/v1/health + +# From n8n to Freescout +docker-compose exec n8n ping +``` + +--- + +## Alert Thresholds + +Configure monitoring/alerting for these conditions: + +| Metric | Threshold | Action | +|--------|-----------|--------| +| Error Rate | > 5% | Page on-call, review workflow logs | +| KB Growth Stalled | 0 entries in 4 hours | Check Milvus health and embeddings | +| Approval Rate | < 50% | Review AI suggestion quality | +| Processing Rate | Drop > 50% | Check Freescout connection | +| Milvus Health | Not healthy | Restart Milvus, check etcd/minio | +| PostgreSQL Connections | > 80% of max | Investigate connection leaks | + +--- + +## Regular Maintenance + +### Daily +- [ ] Check error rate < 5% +- [ ] Verify KB growth is progressing +- [ ] Review Freescout API response times + +### Weekly +- [ ] Analyze approval rate trends +- [ ] Check PostgreSQL disk usage +- [ ] Review n8n workflow performance + +### Monthly +- [ ] Full system health audit +- [ ] Database maintenance (VACUUM, ANALYZE) +- [ ] Log rotation verification +- [ ] Capacity planning review + +--- + +## Version Information + +- **n8n**: Latest from `docker.n8n.io/n8nio/n8n` +- **PostgreSQL**: 15-alpine +- **Milvus**: v2.4.0 +- **Logging Driver**: json-file with max 100MB per file, 10 files rotation + +## Contact & Escalation + +For issues not resolved by this guide: +1. Collect logs: `docker-compose logs > system_logs.txt` +2. Export database state for analysis +3. Contact DevOps team with reproducible steps