infra: logging and monitoring setup
This commit is contained in:
455
docs/MONITORING.md
Normal file
455
docs/MONITORING.md
Normal file
@@ -0,0 +1,455 @@
|
||||
# Monitoring & Logging Setup
|
||||
|
||||
## Overview
|
||||
|
||||
This document provides comprehensive monitoring and logging guidelines for the n8n AI Support Automation system. It includes key metrics, troubleshooting procedures, and log inspection commands.
|
||||
|
||||
## Key Metrics
|
||||
|
||||
### 1. Mail Processing Rate (Workflow A)
|
||||
|
||||
**Description:** Track the number of conversations processed through the system.
|
||||
|
||||
**N8N Logs:**
|
||||
```bash
|
||||
docker-compose logs -f n8n | grep "processed"
|
||||
```
|
||||
|
||||
**PostgreSQL Query:**
|
||||
```sql
|
||||
SELECT COUNT(*) as total_executions,
|
||||
COUNT(CASE WHEN status = 'success' THEN 1 END) as successful_executions,
|
||||
ROUND(100.0 * COUNT(CASE WHEN status = 'success' THEN 1 END) / COUNT(*), 2) as success_rate
|
||||
FROM workflow_executions
|
||||
WHERE workflow_name = 'workflow-a';
|
||||
```
|
||||
|
||||
**Expected Behavior:**
|
||||
- Consistent processing rate (depends on Freescout mail polling interval)
|
||||
- Success rate > 95%
|
||||
- Monitor for sudden drops in processing rate
|
||||
|
||||
---
|
||||
|
||||
### 2. Approval Rate (Workflow B)
|
||||
|
||||
**Description:** Monitor the ratio of approved vs rejected KB updates from the AI suggestions.
|
||||
|
||||
**PostgreSQL Query:**
|
||||
```sql
|
||||
SELECT status, COUNT(*) as count,
|
||||
ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER (), 2) as percentage
|
||||
FROM knowledge_base_updates
|
||||
GROUP BY status
|
||||
ORDER BY count DESC;
|
||||
```
|
||||
|
||||
**Alternative Query for detailed breakdown:**
|
||||
```sql
|
||||
SELECT
|
||||
status,
|
||||
COUNT(*) as count,
|
||||
AVG(EXTRACT(EPOCH FROM (updated_at - created_at))) as avg_approval_time_seconds
|
||||
FROM knowledge_base_updates
|
||||
GROUP BY status;
|
||||
```
|
||||
|
||||
**Expected Behavior:**
|
||||
- Majority of updates should be APPROVED (typically 70-90%)
|
||||
- REJECTED rate should be < 15%
|
||||
- PENDING updates should be resolved within 24 hours
|
||||
|
||||
---
|
||||
|
||||
### 3. KB Growth (Workflow C)
|
||||
|
||||
**Description:** Track the growth of the knowledge base as new information is added.
|
||||
|
||||
**Milvus Query:**
|
||||
```bash
|
||||
# First, connect to Milvus
|
||||
docker-compose exec milvus python3 -c "
|
||||
from pymilvus import connections, Collection
|
||||
|
||||
connections.connect('default', host='localhost', port=19530)
|
||||
collection = Collection('knowledge_base')
|
||||
print(f'Total vectors: {collection.num_entities}')
|
||||
"
|
||||
```
|
||||
|
||||
**PostgreSQL Query for tracking:**
|
||||
```sql
|
||||
SELECT COUNT(*) as total_entries,
|
||||
COUNT(DISTINCT source) as unique_sources,
|
||||
MAX(created_at) as latest_entry
|
||||
FROM knowledge_base
|
||||
WHERE status = 'approved';
|
||||
```
|
||||
|
||||
**Daily Growth Query:**
|
||||
```sql
|
||||
SELECT DATE(created_at) as date, COUNT(*) as entries_added
|
||||
FROM knowledge_base
|
||||
WHERE status = 'approved'
|
||||
GROUP BY DATE(created_at)
|
||||
ORDER BY date DESC
|
||||
LIMIT 30;
|
||||
```
|
||||
|
||||
**Expected Behavior:**
|
||||
- +1 vector per approved ticket (approximately)
|
||||
- Steady growth correlates with approved KB updates
|
||||
- Monitor for stalled growth (may indicate Milvus issues)
|
||||
|
||||
---
|
||||
|
||||
### 4. Error Rate
|
||||
|
||||
**Description:** Monitor workflow execution errors across all workflows.
|
||||
|
||||
**PostgreSQL Query - Overall Error Rate:**
|
||||
```sql
|
||||
SELECT
|
||||
COUNT(*) as total_executions,
|
||||
COUNT(CASE WHEN status = 'ERROR' THEN 1 END) as error_count,
|
||||
ROUND(100.0 * COUNT(CASE WHEN status = 'ERROR' THEN 1 END) / COUNT(*), 2) as error_percentage
|
||||
FROM workflow_executions;
|
||||
```
|
||||
|
||||
**Detailed Error Analysis:**
|
||||
```sql
|
||||
SELECT
|
||||
workflow_name,
|
||||
status,
|
||||
COUNT(*) as count,
|
||||
ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER (PARTITION BY workflow_name), 2) as percentage
|
||||
FROM workflow_executions
|
||||
GROUP BY workflow_name, status
|
||||
ORDER BY workflow_name, error_count DESC;
|
||||
```
|
||||
|
||||
**Error Details for Investigation:**
|
||||
```sql
|
||||
SELECT
|
||||
workflow_name,
|
||||
status,
|
||||
error_message,
|
||||
COUNT(*) as occurrences,
|
||||
MAX(executed_at) as latest_error
|
||||
FROM workflow_executions
|
||||
WHERE status = 'ERROR'
|
||||
GROUP BY workflow_name, status, error_message
|
||||
ORDER BY occurrences DESC;
|
||||
```
|
||||
|
||||
**Expected Behavior:**
|
||||
- Error rate < 5%
|
||||
- No recurring errors (indicates systemic issue)
|
||||
- Quick recovery from transient errors
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting Guide
|
||||
|
||||
### Workflow A (Mail Processing) - Not Running
|
||||
|
||||
**Symptoms:**
|
||||
- No new conversations being processed
|
||||
- N8N logs show no activity
|
||||
- PostgreSQL query returns unchanged row count
|
||||
|
||||
**Troubleshooting Steps:**
|
||||
|
||||
1. **Check if workflow trigger is active:**
|
||||
```bash
|
||||
docker-compose logs -f n8n | grep "workflow-a"
|
||||
```
|
||||
|
||||
2. **Verify Cron trigger configuration:**
|
||||
- Log into n8n UI at `https://<SUBDOMAIN>.<DOMAIN>`
|
||||
- Navigate to workflow-a
|
||||
- Check cron expression (typically: `0 */5 * * * *` for every 5 minutes)
|
||||
- Verify "Active" toggle is ON
|
||||
|
||||
3. **Test Freescout API credentials:**
|
||||
```bash
|
||||
docker-compose exec n8n curl -X GET \
|
||||
-H "Authorization: Bearer ${FREESCOUT_API_TOKEN}" \
|
||||
https://<freescout-instance>/api/v1/conversations
|
||||
```
|
||||
|
||||
4. **Check Freescout API reachability:**
|
||||
```bash
|
||||
docker-compose exec n8n ping <freescout-instance>
|
||||
docker-compose exec n8n curl -I https://<freescout-instance>/api/v1/health
|
||||
```
|
||||
|
||||
5. **Review n8n logs for errors:**
|
||||
```bash
|
||||
docker-compose logs n8n | grep -i "error\|exception" | tail -20
|
||||
```
|
||||
|
||||
6. **Verify PostgreSQL connection:**
|
||||
```bash
|
||||
docker-compose logs n8n | grep -i "database\|postgres"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Workflow B (AI Suggestions) - Not Triggering
|
||||
|
||||
**Symptoms:**
|
||||
- No new AI suggestions in Freescout
|
||||
- workflow_executions table shows no recent B entries
|
||||
- knowledge_base_updates status stuck in PENDING
|
||||
|
||||
**Troubleshooting Steps:**
|
||||
|
||||
1. **Check if Freescout custom field is being updated:**
|
||||
```sql
|
||||
SELECT * FROM freescout_conversation_custom_fields
|
||||
WHERE field_name = 'AI_SUGGESTION_STATUS'
|
||||
ORDER BY updated_at DESC
|
||||
LIMIT 10;
|
||||
```
|
||||
|
||||
2. **Verify polling interval:**
|
||||
- Check n8n workflow B settings
|
||||
- Polling trigger should be running (typically every 1 minute)
|
||||
- Confirm: `docker-compose logs n8n | grep -i "polling\|workflow-b"`
|
||||
|
||||
3. **Check webhook configuration:**
|
||||
```bash
|
||||
# If using webhook instead of polling
|
||||
docker-compose logs -f n8n | grep -i "webhook"
|
||||
```
|
||||
|
||||
4. **Review Freescout API response:**
|
||||
```bash
|
||||
docker-compose exec postgres psql -U kb_user -d n8n_kb -c \
|
||||
"SELECT * FROM api_logs WHERE endpoint LIKE '%conversation%' ORDER BY timestamp DESC LIMIT 5;"
|
||||
```
|
||||
|
||||
5. **Verify OpenAI/AI provider connectivity:**
|
||||
```bash
|
||||
docker-compose logs n8n | grep -i "openai\|api\|llm" | tail -20
|
||||
```
|
||||
|
||||
6. **Check if there are unprocessed conversations:**
|
||||
```sql
|
||||
SELECT COUNT(*) as pending_conversations
|
||||
FROM workflow_executions
|
||||
WHERE workflow_name = 'workflow-a'
|
||||
AND status = 'success'
|
||||
AND ai_suggestion_generated = false
|
||||
AND created_at > NOW() - INTERVAL '1 hour';
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Workflow C (KB Storage) - Not Saving to Milvus
|
||||
|
||||
**Symptoms:**
|
||||
- knowledge_base table updates but Milvus count doesn't increase
|
||||
- KB search returns no results
|
||||
- Milvus health check failures
|
||||
|
||||
**Troubleshooting Steps:**
|
||||
|
||||
1. **Check Milvus health status:**
|
||||
```bash
|
||||
docker-compose exec milvus curl -s http://localhost:9091/healthz | jq .
|
||||
```
|
||||
|
||||
2. **Verify Milvus is running:**
|
||||
```bash
|
||||
docker-compose ps milvus
|
||||
docker-compose logs milvus | tail -30
|
||||
```
|
||||
|
||||
3. **Check if embeddings are being generated:**
|
||||
```sql
|
||||
SELECT COUNT(*) as embeddings_generated
|
||||
FROM knowledge_base
|
||||
WHERE embedding IS NOT NULL;
|
||||
```
|
||||
|
||||
4. **Verify Milvus connection in n8n logs:**
|
||||
```bash
|
||||
docker-compose logs n8n | grep -i "milvus\|embedding" | tail -20
|
||||
```
|
||||
|
||||
5. **Test Milvus directly:**
|
||||
```bash
|
||||
docker-compose exec milvus python3 << 'EOF'
|
||||
from pymilvus import connections, Collection
|
||||
connections.connect('default', host='localhost', port=19530)
|
||||
try:
|
||||
collection = Collection('knowledge_base')
|
||||
print(f'✓ Milvus connected, collection entities: {collection.num_entities}')
|
||||
except Exception as e:
|
||||
print(f'✗ Milvus error: {e}')
|
||||
EOF
|
||||
```
|
||||
|
||||
6. **Check for rate limiting or connection timeouts:**
|
||||
```bash
|
||||
docker-compose logs n8n | grep -i "timeout\|connection\|refused" | tail -20
|
||||
```
|
||||
|
||||
7. **Verify vector dimension matches:**
|
||||
- Check embedding model (should match Milvus collection definition)
|
||||
- Default: 1536 dimensions (OpenAI embeddings)
|
||||
```sql
|
||||
SELECT vector_dimension FROM milvus_schema WHERE collection_name = 'knowledge_base';
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Logs & Debugging Commands
|
||||
|
||||
### View Real-time Logs
|
||||
|
||||
**N8N Logs:**
|
||||
```bash
|
||||
# All n8n logs
|
||||
docker-compose logs -f n8n
|
||||
|
||||
# Follow specific keywords
|
||||
docker-compose logs -f n8n | grep -i "error\|workflow\|processed"
|
||||
|
||||
# Last 100 lines
|
||||
docker-compose logs --tail 100 n8n
|
||||
```
|
||||
|
||||
**PostgreSQL Logs:**
|
||||
```bash
|
||||
# View recent PostgreSQL operations
|
||||
docker-compose logs -f postgres
|
||||
|
||||
# Check database activity
|
||||
docker-compose exec postgres psql -U kb_user -d n8n_kb -c \
|
||||
"SELECT now(), datname, usename, state FROM pg_stat_activity;"
|
||||
```
|
||||
|
||||
**Milvus Logs:**
|
||||
```bash
|
||||
# View Milvus startup and operation logs
|
||||
docker-compose logs -f milvus
|
||||
|
||||
# Check Milvus status
|
||||
docker-compose exec milvus curl -s http://localhost:9091/healthz
|
||||
```
|
||||
|
||||
### Database Inspection
|
||||
|
||||
**Recent Workflow Executions:**
|
||||
```bash
|
||||
docker-compose exec postgres psql -U kb_user -d n8n_kb -c \
|
||||
"SELECT workflow_name, status, executed_at, error_message FROM workflow_executions ORDER BY executed_at DESC LIMIT 10;"
|
||||
```
|
||||
|
||||
**KB Updates Status:**
|
||||
```bash
|
||||
docker-compose exec postgres psql -U kb_user -d n8n_kb -c \
|
||||
"SELECT status, COUNT(*) FROM knowledge_base_updates GROUP BY status;"
|
||||
```
|
||||
|
||||
**Last 24h Activity:**
|
||||
```bash
|
||||
docker-compose exec postgres psql -U kb_user -d n8n_kb -c \
|
||||
"SELECT DATE(executed_at) as date, workflow_name, status, COUNT(*) as count
|
||||
FROM workflow_executions
|
||||
WHERE executed_at > NOW() - INTERVAL '24 hours'
|
||||
GROUP BY DATE(executed_at), workflow_name, status
|
||||
ORDER BY date DESC, workflow_name;"
|
||||
```
|
||||
|
||||
### Performance Monitoring
|
||||
|
||||
**PostgreSQL Connection Count:**
|
||||
```bash
|
||||
docker-compose exec postgres psql -U kb_user -d n8n_kb -c \
|
||||
"SELECT count(*) as connections FROM pg_stat_activity;"
|
||||
```
|
||||
|
||||
**PostgreSQL Cache Hit Ratio:**
|
||||
```bash
|
||||
docker-compose exec postgres psql -U kb_user -d n8n_kb -c \
|
||||
"SELECT sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) as ratio
|
||||
FROM pg_statio_user_tables;"
|
||||
```
|
||||
|
||||
**Disk Usage:**
|
||||
```bash
|
||||
docker-compose exec postgres psql -U kb_user -d n8n_kb -c \
|
||||
"SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename))
|
||||
FROM pg_tables
|
||||
WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
|
||||
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;"
|
||||
```
|
||||
|
||||
### Debugging Network Issues
|
||||
|
||||
**Test connectivity between services:**
|
||||
```bash
|
||||
# From n8n to PostgreSQL
|
||||
docker-compose exec n8n ping postgres
|
||||
|
||||
# From n8n to Milvus
|
||||
docker-compose exec n8n curl -v http://milvus:19530/api/v1/health
|
||||
|
||||
# From n8n to Freescout
|
||||
docker-compose exec n8n ping <freescout-host>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Alert Thresholds
|
||||
|
||||
Configure monitoring/alerting for these conditions:
|
||||
|
||||
| Metric | Threshold | Action |
|
||||
|--------|-----------|--------|
|
||||
| Error Rate | > 5% | Page on-call, review workflow logs |
|
||||
| KB Growth Stalled | 0 entries in 4 hours | Check Milvus health and embeddings |
|
||||
| Approval Rate | < 50% | Review AI suggestion quality |
|
||||
| Processing Rate | Drop > 50% | Check Freescout connection |
|
||||
| Milvus Health | Not healthy | Restart Milvus, check etcd/minio |
|
||||
| PostgreSQL Connections | > 80% of max | Investigate connection leaks |
|
||||
|
||||
---
|
||||
|
||||
## Regular Maintenance
|
||||
|
||||
### Daily
|
||||
- [ ] Check error rate < 5%
|
||||
- [ ] Verify KB growth is progressing
|
||||
- [ ] Review Freescout API response times
|
||||
|
||||
### Weekly
|
||||
- [ ] Analyze approval rate trends
|
||||
- [ ] Check PostgreSQL disk usage
|
||||
- [ ] Review n8n workflow performance
|
||||
|
||||
### Monthly
|
||||
- [ ] Full system health audit
|
||||
- [ ] Database maintenance (VACUUM, ANALYZE)
|
||||
- [ ] Log rotation verification
|
||||
- [ ] Capacity planning review
|
||||
|
||||
---
|
||||
|
||||
## Version Information
|
||||
|
||||
- **n8n**: Latest from `docker.n8n.io/n8nio/n8n`
|
||||
- **PostgreSQL**: 15-alpine
|
||||
- **Milvus**: v2.4.0
|
||||
- **Logging Driver**: json-file with max 100MB per file, 10 files rotation
|
||||
|
||||
## Contact & Escalation
|
||||
|
||||
For issues not resolved by this guide:
|
||||
1. Collect logs: `docker-compose logs > system_logs.txt`
|
||||
2. Export database state for analysis
|
||||
3. Contact DevOps team with reproducible steps
|
||||
Reference in New Issue
Block a user