# Go-Live Checklist ## Overview This checklist ensures a smooth transition from staging to production. Follow all phases sequentially: Pre-Launch, Go-Live Day, Launch Period, and Post-Launch Monitoring. --- ## Phase 1: One Week Before Go-Live Timeline: T-7 days ### Pre-Deployment Verification - [ ] **E2E Tests Passed (100%)** - All workflow tests successful: `bash tests/curl-test-collection.sh` - No critical bugs or failures - Test results documented - [ ] **Staging Environment Verified** - Deploy to staging identical to production - Run full load test (simulate 100+ concurrent tickets) - Verify integration with TEST Freescout account - Verify integration with TEST Baramundi account - All workflows processing correctly in staging - [ ] **Production Database Setup** - PostgreSQL audit schema initialized - Backup strategy configured and tested - Database performance baseline recorded - Disk space verified (minimum 20GB available) - [ ] **API Credentials Verified** - Freescout API key tested and active - Freescout custom fields created and working - Baramundi API key tested and active - n8n encryption key generated and secure ### Team Readiness - [ ] **Team Training Completed** - Operations team trained on: - System architecture overview - Deployment and rollback procedures - Monitoring dashboard usage - Alert response procedures - Escalation paths - Support team trained on: - Workflow functionality overview - Expected behavior and timing - How to verify system health - When to escalate issues - [ ] **Documentation Review** - All team members reviewed DEPLOYMENT.md - All team members reviewed MONITORING.md - Runbooks reviewed and acknowledged - Contact list updated (on-call schedule) - [ ] **Backup Strategy Finalized** - Daily backup schedule defined - Backup retention policy set (7 days minimum) - Backup restore procedure tested - Backup storage verified (separate location from production) ### Risk Mitigation - [ ] **Rollback Plan Confirmed** - Rollback procedures documented - Rollback tested in staging environment - Estimated rollback time: < 30 minutes - All team members trained on rollback - [ ] **Communication Plan Ready** - Stakeholder notification list prepared - Status page update process defined - Internal update frequency established (30min intervals initially) - Escalation contacts verified - [ ] **Monitoring & Alerting** - All monitoring dashboards configured - Alert recipients confirmed - Alert thresholds set and validated - On-call rotation established --- ## Phase 2: Go-Live Day Timeline: T-0 (Launch day) ### Pre-Launch Checks (T-2 hours) - [ ] **Final System Status** - All Docker services running and healthy - `docker-compose ps` output verified - All services show "Up (healthy)" - No services in "Restarting" state - [ ] **Service Health Verification** - n8n health check: `curl http://localhost:5678/api/v1/health` - PostgreSQL connection: `docker-compose exec postgres pg_isready` - Milvus connectivity: Vector DB responding - External integrations reachable (Freescout, Baramundi) - [ ] **Database Integrity** - Audit schema verified: `SELECT COUNT(*) FROM audit.workflows;` - No corruption or errors in logs - Backup created and verified: `ls -lh backups/` - Backup restore tested - [ ] **n8n Workflows Status** - All 3 workflows imported successfully - Workflow A (Mail Processing): Ready - Workflow B (Approval Execution): Ready - Workflow C (KB Update): Ready - All workflows set to Inactive (will activate after final check) - [ ] **Monitoring System Active** - Monitoring dashboard accessible - All metric collectors running - Alert system armed and tested - Log aggregation working (docker-compose logs verified) - [ ] **Final Pre-Launch Meeting** - All team members present and ready - Roles and responsibilities confirmed: - Platform Lead: Overall coordination - n8n Administrator: Workflow management - Database Administrator: Database monitoring - System Administrator: Infrastructure monitoring - Support Lead: User support readiness - Communication channels verified (Slack, phone, etc.) ### Launch Window (T-0 hours) - [ ] **Final Backup (T-15 minutes)** - Backup created immediately before activation - Backup file verified and tested - Backup location: `backups/pre-golive-backup-$(date +%Y%m%d-%H%M%S).sql` - [ ] **Activate Workflows (T-0 minutes)** - n8n Dashboard accessed - Workflow A (Mail Processing) activated: - Toggle "Active" switch ON - Verify activation confirmed in UI - Check logs: `docker-compose logs -f n8n | grep "Workflow A"` - Workflow B (Approval Execution) activated - Workflow C (KB Update) activated - All three workflows showing "Active" status - [ ] **Launch Announcement** - Internal team notified: "System is LIVE" - Stakeholders notified of go-live - Status page updated: "System operational" - Time of launch recorded: __________ - [ ] **Confirm System Accepting Requests** - Send test email to Freescout inbox - Verify ticket created in Freescout - Verify n8n workflow triggered (check logs) - Verify workflow execution started --- ## Phase 3: Launch Period Monitoring (First 24 Hours) Timeline: T+0 to T+24 hours ### Continuous Monitoring (Every 15 minutes) - [ ] **n8n Workflow Execution** - Command: `docker-compose logs -f n8n | tail -50` - Check for: - No error messages - Workflows executing successfully - No hung or stuck executions - Log location: `/d/n8n-compose/logs/n8n.log` - [ ] **Freescout Integration** - New tickets arriving in system - Custom fields populated correctly - No integration errors in Freescout logs - Ticket processing speed acceptable - [ ] **Baramundi Job Queue** - Check job queue status - Verify jobs accepted from n8n - Monitor job completion rate - Check for failed jobs - [ ] **Alert System** - All critical alerts functioning - No false positive alerts - Escalation procedures working - On-call team responsive - [ ] **Database Performance** - Query performance acceptable - No locks or deadlocks - Disk space usage normal - Command: `docker-compose exec postgres pg_stat_statements` ### Hourly System Status Report (First 6 hours) Document every hour: **Hour 1 (T+1h)** - Total tickets processed: _____ - Total workflows executed: _____ - Failed executions: _____ - System health: [ ] Green [ ] Yellow [ ] Red - Issues encountered: _____ **Hour 2 (T+2h)** - Total tickets processed: _____ - Total workflows executed: _____ - Failed executions: _____ - System health: [ ] Green [ ] Yellow [ ] Red - Issues encountered: _____ **Hour 3-6 (T+3h to T+6h)** - Repeat above for each hour - Escalate any issues immediately - Document all changes or interventions ### Functional Validation (T+2 hours and T+12 hours) **After 2 hours:** - [ ] **AI Suggestions Displayed** - Sample processed tickets show AI suggestions - Suggestion accuracy acceptable - Custom field updated with ai_suggestion - Performance acceptable (< 5 second processing time) - [ ] **Approval Workflow Operating** - HIGH priority tickets flagged for approval - Approval custom field populated - Notifications sent to approvers - Approvals received and reflected in system - [ ] **Knowledge Base Updates** - KB articles being created/updated - Vector embeddings generated (Milvus) - PostgreSQL KB table growing - Query: `SELECT COUNT(*) FROM audit.kb_articles;` **After 12 hours (overnight validation):** - [ ] **Validate Overnight Processing** - All workflows executed correctly overnight - No race conditions or deadlocks occurred - Database backups completed successfully - All alerts functioned as expected ### Critical Metrics (Monitor Continuously) ```bash # Check n8n workflow execution rate curl -H "X-N8N-API-KEY: $N8N_API_KEY" \ http://localhost:5678/api/v1/executions?limit=100 | jq '.executions | length' # Check database growth docker exec n8n-postgres psql -U n8n_user -d n8n_production -c \ "SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) FROM pg_tables ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC LIMIT 10;" # Monitor CPU/Memory docker stats --no-stream ``` ### Incident Response (If Issues Occur) **Critical Issue (System Down):** 1. Immediately notify team lead 2. Assess severity and scope 3. Execute rollback if necessary (see DEPLOYMENT.md) 4. Document incident details 5. Begin root cause analysis **Performance Degradation:** 1. Check system resources: `docker stats` 2. Check database locks: `docker-compose logs postgres | grep LOCK` 3. Scale resources if needed: `docker-compose up -d --scale n8n=2` 4. Monitor improvement **Integration Failures:** 1. Verify API credentials still valid 2. Check external service status 3. Review integration logs 4. Test connectivity manually 5. Retry or escalate --- ## Phase 4: Post-24 Hour Validation (T+24 to T+7 days) ### Day 2 Validation (T+24 hours) - [ ] **Verify KI Suggestions Working** - Sample 10 random processed tickets - AI suggestions present and relevant - Suggestion accuracy rate > 80% - Processing time < 5 seconds average - Document findings: ________ - [ ] **Approval Workflow Performance** - [ ] All HIGH priority tickets flagged - [ ] Approval response time < 2 hours - [ ] Approval completion rate > 95% - [ ] No pending approvals > 4 hours old - Total approvals processed: _____ - Approval success rate: _____% - [ ] **Baramundi Integration Validation** - [ ] Jobs submitted successfully - [ ] Job queue processing normally - [ ] Job completion rate > 90% - [ ] No stuck or failed jobs - Total jobs processed: _____ - Job success rate: _____% - [ ] **Knowledge Base Growth** - [ ] KB articles being created - [ ] Vector embeddings calculated - [ ] Query performance acceptable - Total KB articles: _____ - Total embeddings: _____ - Query response time: _____ ms - [ ] **System Stability** - [ ] No service crashes - [ ] No memory leaks - [ ] Disk usage normal - [ ] Database integrity verified - [ ] No orphaned records ### Day 7 Comprehensive Review (T+7 days) - [ ] **Collect Statistics** **Email Processing:** - Total emails processed: _____ - Success rate: _____% - Average processing time: _____ seconds - Error rate: _____% **AI Suggestions:** - Total suggestions generated: _____ - Acceptance rate: _____% - Average accuracy: _____% - Processing time p95: _____ seconds **Approvals:** - Total approval requests: _____ - Total approvals completed: _____ - Approval completion rate: _____% - Average response time: _____ minutes - HIGH priority count: _____ **Baramundi Jobs:** - Total jobs submitted: _____ - Total jobs completed: _____ - Success rate: _____% - Failed jobs: _____ **Knowledge Base:** - Total KB articles created: _____ - Total articles updated: _____ - Total searches: _____ - Average search response: _____ ms - [ ] **Performance Analysis** - [ ] n8n CPU usage normal: _____ % - [ ] n8n Memory usage normal: _____ MB - [ ] PostgreSQL query time p95: _____ ms - [ ] Database size: _____ GB - [ ] Backup size: _____ GB - [ ] **Team Feedback Collected** - [ ] Operations team feedback: ________ - [ ] Support team feedback: ________ - [ ] End user feedback: ________ - [ ] Issues encountered: ________ - [ ] Improvement suggestions: ________ - [ ] **Issue Resolution Status** - [ ] All critical issues resolved - [ ] All high priority issues resolved - [ ] Medium priority issues tracked - [ ] Minor issues documented for next release - Issue tracking document: __________ ### Go-Live Success Criteria - Final Sign-Off All criteria must be met to declare go-live successful: - [ ] **Stability (99% uptime minimum)** - System remained operational for 7 consecutive days - Unplanned downtime < 14.4 minutes total - All services restarted cleanly without issues - [ ] **Functionality (100% requirements met)** - Mail processing working correctly - AI suggestions functional and accurate - Approval workflow operational - Baramundi job submission successful - KB updates functioning - [ ] **Performance (Acceptable for workload)** - Average email processing < 5 seconds - Average workflow execution < 10 seconds - Database queries < 1 second (p95) - No performance degradation observed - [ ] **Data Integrity (100% accuracy)** - All processed tickets correctly handled - No duplicate records - No data loss or corruption - Audit trail complete and accurate - [ ] **Monitoring (All systems active)** - Real-time dashboards operational - Alerts functioning correctly - Logs aggregated and searchable - Performance metrics recorded - [ ] **Team Readiness (100% trained)** - Operations team fully trained - Support team fully trained - All runbooks completed - On-call schedule established **Sign-Off By:** Project Manager: _________________ Date: _______ Operations Lead: _________________ Date: _______ Technical Lead: _________________ Date: _______ --- ## Ongoing Monitoring (Post Go-Live) ### Daily Checks (First 30 Days) - [ ] Review system health dashboard - [ ] Check backup completion status - [ ] Review error logs for new issues - [ ] Verify workflow execution metrics - [ ] Check database growth rate - [ ] Monitor alert frequency and relevance ### Weekly Checks (Ongoing) - [ ] Generate performance report - [ ] Review all system logs - [ ] Verify backup restore capability - [ ] Update documentation as needed - [ ] Team retrospective meeting - [ ] Plan for optimization improvements ### Monthly Reviews (Ongoing) - [ ] Comprehensive system audit - [ ] Capacity planning review - [ ] Security assessment - [ ] Performance optimization review - [ ] Team training refresher (as needed) - [ ] Update escalation procedures --- ## Contacts and Escalation ### Primary Contacts **Project Manager:** - Name: _____________________ - Phone: _____________________ - Email: _____________________ **Technical Lead:** - Name: _____________________ - Phone: _____________________ - Email: _____________________ **On-Call Engineer:** - Name: _____________________ - Phone: _____________________ - Email: _____________________ ### Escalation Matrix **Level 1 - Application Issue:** - On-call engineer - Response time: 15 minutes **Level 2 - System Down:** - Technical lead + On-call engineer - Response time: 5 minutes **Level 3 - Critical Data Loss:** - Technical lead + Project manager + Database admin - Response time: Immediate --- ## Related Documentation - [DEPLOYMENT.md](DEPLOYMENT.md) - Deployment procedures and rollback - [MONITORING.md](MONITORING.md) - Monitoring dashboard and alerts - [ARCHITECTURE.md](ARCHITECTURE.md) - System architecture details - [TROUBLESHOOTING.md](TROUBLESHOOTING.md) - Common issues and solutions