Files

Claude Code caac04c684 docs: deployment and go-live documentation

2026-03-16 17:32:59 +01:00

15 KiB

Raw Permalink Blame History

Go-Live Checklist

Overview

This checklist ensures a smooth transition from staging to production. Follow all phases sequentially: Pre-Launch, Go-Live Day, Launch Period, and Post-Launch Monitoring.

Phase 1: One Week Before Go-Live

Timeline: T-7 days

Pre-Deployment Verification

E2E Tests Passed (100%)
- All workflow tests successful: bash tests/curl-test-collection.sh
- No critical bugs or failures
- Test results documented
Staging Environment Verified
- Deploy to staging identical to production
- Run full load test (simulate 100+ concurrent tickets)
- Verify integration with TEST Freescout account
- Verify integration with TEST Baramundi account
- All workflows processing correctly in staging
Production Database Setup
- PostgreSQL audit schema initialized
- Backup strategy configured and tested
- Database performance baseline recorded
- Disk space verified (minimum 20GB available)
API Credentials Verified
- Freescout API key tested and active
- Freescout custom fields created and working
- Baramundi API key tested and active
- n8n encryption key generated and secure

Team Readiness

Team Training Completed
- Operations team trained on:
  - System architecture overview
  - Deployment and rollback procedures
  - Monitoring dashboard usage
  - Alert response procedures
  - Escalation paths
- Support team trained on:
  - Workflow functionality overview
  - Expected behavior and timing
  - How to verify system health
  - When to escalate issues
Documentation Review
- All team members reviewed DEPLOYMENT.md
- All team members reviewed MONITORING.md
- Runbooks reviewed and acknowledged
- Contact list updated (on-call schedule)
Backup Strategy Finalized
- Daily backup schedule defined
- Backup retention policy set (7 days minimum)
- Backup restore procedure tested
- Backup storage verified (separate location from production)

Risk Mitigation

Rollback Plan Confirmed
- Rollback procedures documented
- Rollback tested in staging environment
- Estimated rollback time: < 30 minutes
- All team members trained on rollback
Communication Plan Ready
- Stakeholder notification list prepared
- Status page update process defined
- Internal update frequency established (30min intervals initially)
- Escalation contacts verified
Monitoring & Alerting
- All monitoring dashboards configured
- Alert recipients confirmed
- Alert thresholds set and validated
- On-call rotation established

Phase 2: Go-Live Day

Timeline: T-0 (Launch day)

Pre-Launch Checks (T-2 hours)

Final System Status
- All Docker services running and healthy
- docker-compose ps output verified
- All services show "Up (healthy)"
- No services in "Restarting" state
Service Health Verification
- n8n health check: curl http://localhost:5678/api/v1/health
- PostgreSQL connection: docker-compose exec postgres pg_isready
- Milvus connectivity: Vector DB responding
- External integrations reachable (Freescout, Baramundi)
Database Integrity
- Audit schema verified: SELECT COUNT(*) FROM audit.workflows;
- No corruption or errors in logs
- Backup created and verified: ls -lh backups/
- Backup restore tested
n8n Workflows Status
- All 3 workflows imported successfully
- Workflow A (Mail Processing): Ready
- Workflow B (Approval Execution): Ready
- Workflow C (KB Update): Ready
- All workflows set to Inactive (will activate after final check)
Monitoring System Active
- Monitoring dashboard accessible
- All metric collectors running
- Alert system armed and tested
- Log aggregation working (docker-compose logs verified)
Final Pre-Launch Meeting
- All team members present and ready
- Roles and responsibilities confirmed:
  - Platform Lead: Overall coordination
  - n8n Administrator: Workflow management
  - Database Administrator: Database monitoring
  - System Administrator: Infrastructure monitoring
  - Support Lead: User support readiness
- Communication channels verified (Slack, phone, etc.)

Launch Window (T-0 hours)

Final Backup (T-15 minutes)
- Backup created immediately before activation
- Backup file verified and tested
- Backup location: backups/pre-golive-backup-$(date +%Y%m%d-%H%M%S).sql
Activate Workflows (T-0 minutes)
- n8n Dashboard accessed
- Workflow A (Mail Processing) activated:
  - Toggle "Active" switch ON
  - Verify activation confirmed in UI
  - Check logs: docker-compose logs -f n8n | grep "Workflow A"
- Workflow B (Approval Execution) activated
- Workflow C (KB Update) activated
- All three workflows showing "Active" status
Launch Announcement
- Internal team notified: "System is LIVE"
- Stakeholders notified of go-live
- Status page updated: "System operational"
- Time of launch recorded: __________
Confirm System Accepting Requests
- Send test email to Freescout inbox
- Verify ticket created in Freescout
- Verify n8n workflow triggered (check logs)
- Verify workflow execution started

Phase 3: Launch Period Monitoring (First 24 Hours)

Timeline: T+0 to T+24 hours

Continuous Monitoring (Every 15 minutes)

n8n Workflow Execution
- Command: docker-compose logs -f n8n | tail -50
- Check for:
  - No error messages
  - Workflows executing successfully
  - No hung or stuck executions
- Log location: /d/n8n-compose/logs/n8n.log
Freescout Integration
- New tickets arriving in system
- Custom fields populated correctly
- No integration errors in Freescout logs
- Ticket processing speed acceptable
Baramundi Job Queue
- Check job queue status
- Verify jobs accepted from n8n
- Monitor job completion rate
- Check for failed jobs
Alert System
- All critical alerts functioning
- No false positive alerts
- Escalation procedures working
- On-call team responsive
Database Performance
- Query performance acceptable
- No locks or deadlocks
- Disk space usage normal
- Command: docker-compose exec postgres pg_stat_statements

Hourly System Status Report (First 6 hours)

Document every hour:

Hour 1 (T+1h)

Total tickets processed: _____
Total workflows executed: _____
Failed executions: _____
System health: [ ] Green [ ] Yellow [ ] Red
Issues encountered: _____

Hour 2 (T+2h)

Total tickets processed: _____
Total workflows executed: _____
Failed executions: _____
System health: [ ] Green [ ] Yellow [ ] Red
Issues encountered: _____

Hour 3-6 (T+3h to T+6h)

Repeat above for each hour
Escalate any issues immediately
Document all changes or interventions

Functional Validation (T+2 hours and T+12 hours)

After 2 hours:

AI Suggestions Displayed
- Sample processed tickets show AI suggestions
- Suggestion accuracy acceptable
- Custom field updated with ai_suggestion
- Performance acceptable (< 5 second processing time)
Approval Workflow Operating
- HIGH priority tickets flagged for approval
- Approval custom field populated
- Notifications sent to approvers
- Approvals received and reflected in system
Knowledge Base Updates
- KB articles being created/updated
- Vector embeddings generated (Milvus)
- PostgreSQL KB table growing
- Query: SELECT COUNT(*) FROM audit.kb_articles;

After 12 hours (overnight validation):

Validate Overnight Processing
- All workflows executed correctly overnight
- No race conditions or deadlocks occurred
- Database backups completed successfully
- All alerts functioned as expected

Critical Metrics (Monitor Continuously)

# Check n8n workflow execution rate
curl -H "X-N8N-API-KEY: $N8N_API_KEY" \
  http://localhost:5678/api/v1/executions?limit=100 | jq '.executions | length'

# Check database growth
docker exec n8n-postgres psql -U n8n_user -d n8n_production -c \
  "SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename))
   FROM pg_tables ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC LIMIT 10;"

# Monitor CPU/Memory
docker stats --no-stream

Incident Response (If Issues Occur)

Critical Issue (System Down):

Immediately notify team lead
Assess severity and scope
Execute rollback if necessary (see DEPLOYMENT.md)
Document incident details
Begin root cause analysis

Performance Degradation:

Check system resources: docker stats
Check database locks: docker-compose logs postgres | grep LOCK
Scale resources if needed: docker-compose up -d --scale n8n=2
Monitor improvement

Integration Failures:

Verify API credentials still valid
Check external service status
Review integration logs
Test connectivity manually
Retry or escalate

Phase 4: Post-24 Hour Validation (T+24 to T+7 days)

Day 2 Validation (T+24 hours)

Verify KI Suggestions Working
- Sample 10 random processed tickets
- AI suggestions present and relevant
- Suggestion accuracy rate > 80%
- Processing time < 5 seconds average
- Document findings: ________
Approval Workflow Performance
- All HIGH priority tickets flagged
- Approval response time < 2 hours
- Approval completion rate > 95%
- No pending approvals > 4 hours old
- Total approvals processed: _____
- Approval success rate: _____%
Baramundi Integration Validation
- Jobs submitted successfully
- Job queue processing normally
- Job completion rate > 90%
- No stuck or failed jobs
- Total jobs processed: _____
- Job success rate: _____%
Knowledge Base Growth
- KB articles being created
- Vector embeddings calculated
- Query performance acceptable
- Total KB articles: _____
- Total embeddings: _____
- Query response time: _____ ms
System Stability
- No service crashes
- No memory leaks
- Disk usage normal
- Database integrity verified
- No orphaned records

Day 7 Comprehensive Review (T+7 days)

Collect Statistics

Email Processing:
- Total emails processed: _____
- Success rate: _____%
- Average processing time: _____ seconds
- Error rate: _____%
AI Suggestions:
- Total suggestions generated: _____
- Acceptance rate: _____%
- Average accuracy: _____%
- Processing time p95: _____ seconds
Approvals:
- Total approval requests: _____
- Total approvals completed: _____
- Approval completion rate: _____%
- Average response time: _____ minutes
- HIGH priority count: _____
Baramundi Jobs:
- Total jobs submitted: _____
- Total jobs completed: _____
- Success rate: _____%
- Failed jobs: _____
Knowledge Base:
- Total KB articles created: _____
- Total articles updated: _____
- Total searches: _____
- Average search response: _____ ms
Performance Analysis
- n8n CPU usage normal: _____ %
- n8n Memory usage normal: _____ MB
- PostgreSQL query time p95: _____ ms
- Database size: _____ GB
- Backup size: _____ GB
Team Feedback Collected
- Operations team feedback: ________
- Support team feedback: ________
- End user feedback: ________
- Issues encountered: ________
- Improvement suggestions: ________
Issue Resolution Status
- All critical issues resolved
- All high priority issues resolved
- Medium priority issues tracked
- Minor issues documented for next release
- Issue tracking document: __________

Go-Live Success Criteria - Final Sign-Off

All criteria must be met to declare go-live successful:

Stability (99% uptime minimum)
- System remained operational for 7 consecutive days
- Unplanned downtime < 14.4 minutes total
- All services restarted cleanly without issues
Functionality (100% requirements met)
- Mail processing working correctly
- AI suggestions functional and accurate
- Approval workflow operational
- Baramundi job submission successful
- KB updates functioning
Performance (Acceptable for workload)
- Average email processing < 5 seconds
- Average workflow execution < 10 seconds
- Database queries < 1 second (p95)
- No performance degradation observed
Data Integrity (100% accuracy)
- All processed tickets correctly handled
- No duplicate records
- No data loss or corruption
- Audit trail complete and accurate
Monitoring (All systems active)
- Real-time dashboards operational
- Alerts functioning correctly
- Logs aggregated and searchable
- Performance metrics recorded
Team Readiness (100% trained)
- Operations team fully trained
- Support team fully trained
- All runbooks completed
- On-call schedule established

Sign-Off By:

Project Manager: _________________ Date: _______

Operations Lead: _________________ Date: _______

Technical Lead: _________________ Date: _______

Ongoing Monitoring (Post Go-Live)

Daily Checks (First 30 Days)

Review system health dashboard
Check backup completion status
Review error logs for new issues
Verify workflow execution metrics
Check database growth rate
Monitor alert frequency and relevance

Weekly Checks (Ongoing)

Generate performance report
Review all system logs
Verify backup restore capability
Update documentation as needed
Team retrospective meeting
Plan for optimization improvements

Monthly Reviews (Ongoing)

Comprehensive system audit
Capacity planning review
Security assessment
Performance optimization review
Team training refresher (as needed)
Update escalation procedures

Contacts and Escalation

Primary Contacts

Project Manager:

Name: _____________________
Phone: _____________________
Email: _____________________

Technical Lead:

Name: _____________________
Phone: _____________________
Email: _____________________

On-Call Engineer:

Name: _____________________
Phone: _____________________
Email: _____________________

Escalation Matrix

Level 1 - Application Issue:

On-call engineer
Response time: 15 minutes

Level 2 - System Down:

Technical lead + On-call engineer
Response time: 5 minutes

Level 3 - Critical Data Loss:

Technical lead + Project manager + Database admin
Response time: Immediate

DEPLOYMENT.md - Deployment procedures and rollback
MONITORING.md - Monitoring dashboard and alerts
ARCHITECTURE.md - System architecture details
TROUBLESHOOTING.md - Common issues and solutions

15 KiB Raw Permalink Blame History