# MEV Bot - Production Deployment Runbook **Version:** 1.0 **Last Updated:** October 28, 2025 **Audience:** DevOps, Production Engineers --- ## Table of Contents 1. [Pre-Deployment Checklist](#pre-deployment-checklist) 2. [Environment Setup](#environment-setup) 3. [Configuration](#configuration) 4. [Deployment Steps](#deployment-steps) 5. [Post-Deployment Validation](#post-deployment-validation) 6. [Monitoring & Alerting](#monitoring--alerting) 7. [Rollback Procedures](#rollback-procedures) 8. [Troubleshooting](#troubleshooting) --- ## Pre-Deployment Checklist ### Code Readiness - [ ] All tests passing (`make test`) - [ ] Security audit completed and issues addressed - [ ] Code review approved - [ ] 24-hour validation test completed successfully - [ ] Performance benchmarks meet targets - [ ] No critical TODOs in codebase ### Infrastructure Readiness - [ ] RPC endpoints configured and tested - [ ] Private key/wallet funded with gas (minimum 0.1 ETH) - [ ] Monitoring systems operational - [ ] Alert channels configured (Slack, email, PagerDuty) - [ ] Backup RPC endpoints ready - [ ] Database/storage systems ready ### Team Readiness - [ ] On-call engineer assigned - [ ] Runbook reviewed by team - [ ] Communication channels established - [ ] Rollback plan understood - [ ] Emergency contacts documented --- ## Environment Setup ### System Requirements **Minimum:** - CPU: 4 cores - RAM: 8 GB - Disk: 50 GB SSD - Network: 100 Mbps, low latency **Recommended (Production):** - CPU: 8 cores - RAM: 16 GB - Disk: 100 GB NVMe SSD - Network: 1 Gbps, < 20ms latency to Arbitrum RPC ### Dependencies ```bash # Install Go 1.24+ wget https://go.dev/dl/go1.24.linux-amd64.tar.gz sudo tar -C /usr/local -xzf go1.24.linux-amd64.tar.gz export PATH=$PATH:/usr/local/go/bin # Verify installation go version # Should show go1.24 or later # Install build tools sudo apt-get update sudo apt-get install -y build-essential git curl ``` ### Repository Setup ```bash # Clone repository git clone https://github.com/your-org/mev-beta.git cd mev-beta # Checkout production branch git checkout feature/production-profit-optimization # Verify correct branch git log -1 --oneline # Install dependencies go mod download go mod verify ``` --- ## Configuration ### 1. Environment Variables Create `/etc/systemd/system/mev-bot.env`: ```bash # RPC Configuration ARBITRUM_RPC_ENDPOINT=https://arbitrum-mainnet.core.chainstack.com/YOUR_KEY ARBITRUM_WS_ENDPOINT=wss://arbitrum-mainnet.core.chainstack.com/YOUR_KEY # Backup RPC (fallback) BACKUP_RPC_ENDPOINT=https://arb1.arbitrum.io/rpc # Application Configuration LOG_LEVEL=info LOG_FORMAT=json LOG_OUTPUT=/var/log/mev-bot/mev_bot.log # Metrics & Monitoring METRICS_ENABLED=true METRICS_PORT=9090 # Security MEV_BOT_ENCRYPTION_KEY=your-32-char-encryption-key-here-minimum-length-required # Execution Configuration (IMPORTANT: Set to false for detection-only mode) EXECUTION_ENABLED=false MAX_POSITION_SIZE=1000000000000000000 # 1 ETH in wei MIN_PROFIT_THRESHOLD=50000000000000000 # 0.05 ETH in wei # Provider Configuration PROVIDER_CONFIG_PATH=/opt/mev-bot/config/providers_runtime.yaml ``` **CRITICAL:** Never commit `.env` files with real credentials to version control! ### 2. Provider Configuration Edit `config/providers_runtime.yaml`: ```yaml providers: - name: "chainstack-primary" endpoint: "${ARBITRUM_RPC_ENDPOINT}" type: "https" weight: 100 timeout: 30s rateLimit: 100 - name: "chainstack-websocket" endpoint: "${ARBITRUM_WS_ENDPOINT}" type: "wss" weight: 90 timeout: 30s rateLimit: 100 - name: "public-fallback" endpoint: "https://arb1.arbitrum.io/rpc" type: "https" weight: 50 timeout: 30s rateLimit: 50 pooling: maxIdleConnections: 10 maxOpenConnections: 50 connectionTimeout: 30s idleTimeout: 300s retry: maxRetries: 3 retryDelay: 1s backoffMultiplier: 2 maxBackoff: 8s ``` ### 3. Systemd Service Configuration Create `/etc/systemd/system/mev-bot.service`: ```ini [Unit] Description=MEV Arbitrage Bot After=network.target Wants=network-online.target [Service] Type=simple User=mev-bot Group=mev-bot WorkingDirectory=/opt/mev-bot EnvironmentFile=/etc/systemd/system/mev-bot.env ExecStart=/opt/mev-bot/bin/mev-bot start ExecReload=/bin/kill -HUP $MAINPID KillMode=process Restart=on-failure RestartSec=10s # Resource limits LimitNOFILE=65536 MemoryMax=4G CPUQuota=400% # Security hardening NoNewPrivileges=true PrivateTmp=true ProtectSystem=strict ProtectHome=true ReadWritePaths=/var/log/mev-bot /opt/mev-bot/data # Logging StandardOutput=journal StandardError=journal SyslogIdentifier=mev-bot [Install] WantedBy=multi-user.target ``` --- ## Deployment Steps ### Phase 1: Build & Prepare (10-15 minutes) ```bash # 1. Build binary cd /opt/mev-bot make build # Verify binary ./bin/mev-bot --version # Expected: MEV Bot v1.0.0 (or similar) # 2. Run tests make test # Ensure all tests pass # 3. Check binary size and dependencies ls -lh bin/mev-bot ldd bin/mev-bot # Should show minimal dependencies # 4. Create necessary directories sudo mkdir -p /var/log/mev-bot sudo mkdir -p /opt/mev-bot/data sudo chown -R mev-bot:mev-bot /var/log/mev-bot /opt/mev-bot/data # 5. Set permissions chmod +x bin/mev-bot chmod 600 /etc/systemd/system/mev-bot.env # Protect sensitive config ``` ### Phase 2: Dry Run (5-10 minutes) ```bash # Run bot in foreground to verify configuration sudo -u mev-bot /opt/mev-bot/bin/mev-bot start & BOT_PID=$! # Wait 2 minutes for initialization sleep 120 # Check if running ps aux | grep mev-bot # Check logs for errors tail -100 /var/log/mev-bot/mev_bot.log | grep -i error # Verify RPC connection tail -100 /var/log/mev-bot/mev_bot.log | grep -i "connected" # Stop dry run kill $BOT_PID ``` ### Phase 3: Production Start (5 minutes) ```bash # 1. Reload systemd sudo systemctl daemon-reload # 2. Enable service (start on boot) sudo systemctl enable mev-bot # 3. Start service sudo systemctl start mev-bot # 4. Verify status sudo systemctl status mev-bot # Expected: active (running) # 5. Check logs sudo journalctl -u mev-bot -f --lines=50 # 6. Wait for initialization (30-60 seconds) sleep 60 # 7. Verify healthy operation curl -s http://localhost:9090/health/live | jq . # Expected: {"status": "healthy"} ``` ### Phase 4: Validation (15-30 minutes) ```bash # 1. Monitor for opportunities tail -f /var/log/mev-bot/mev_bot.log | grep "ARBITRAGE OPPORTUNITY" # 2. Check metrics endpoint curl -s http://localhost:9090/metrics | grep mev_ # 3. Verify cache performance tail -100 /var/log/mev-bot/mev_bot.log | grep "cache metrics" # Look for hit rate 75-85% # 4. Check for errors sudo journalctl -u mev-bot --since "10 minutes ago" | grep ERROR # Should have minimal errors # 5. Monitor resource usage htop # Check CPU and memory # CPU should be 50-80%, Memory < 2GB # 6. Test failover (optional) # Temporarily block primary RPC, verify fallback works ``` --- ## Post-Deployment Validation ### Health Checks ```bash # Liveness probe (should return 200) curl -f http://localhost:9090/health/live || echo "LIVENESS FAILED" # Readiness probe (should return 200) curl -f http://localhost:9090/health/ready || echo "READINESS FAILED" # Startup probe (should return 200 after initialization) curl -f http://localhost:9090/health/startup || echo "STARTUP FAILED" ``` ### Performance Metrics ```bash # Check Prometheus metrics curl -s http://localhost:9090/metrics | grep -E "mev_(opportunities|executions|profit)" # Expected metrics: # - mev_opportunities_detected{} # - mev_opportunities_profitable{} # - mev_cache_hit_rate{} 0.75-0.85 # - mev_rpc_calls_total{} ``` ### Log Analysis ```bash # Analyze last hour of logs ./scripts/log-manager.sh analyze # Check health score (target: > 90) ./scripts/log-manager.sh health # Expected output: # Health Score: 95.5/100 (Excellent) # Error Rate: < 5% # Cache Hit Rate: 75-85% ``` --- ## Monitoring & Alerting ### Key Metrics to Monitor | Metric | Threshold | Action | |--------|-----------|--------| | CPU Usage | > 90% | Scale up or investigate | | Memory Usage | > 85% | Potential memory leak | | Error Rate | > 10% | Check logs, may need rollback | | RPC Failures | > 5/min | Check RPC provider | | Opportunities/hour | < 1 | May indicate detection issue | | Cache Hit Rate | < 70% | Review cache configuration | ### Alert Configuration **Slack Webhook** (edit in `config/alerts.yaml`): ```yaml alerts: slack: enabled: true webhook_url: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL" channel: "#mev-bot-alerts" thresholds: error_rate: 0.10 # 10% cpu_usage: 0.90 # 90% memory_usage: 0.85 # 85% min_opportunities_per_hour: 1 ``` ### Monitoring Commands ```bash # Real-time monitoring watch -n 5 'systemctl status mev-bot && curl -s http://localhost:9090/metrics | grep mev_' # Start monitoring daemon (background) ./scripts/log-manager.sh start-daemon # View operations dashboard ./scripts/log-manager.sh dashboard # Opens HTML dashboard in browser ``` --- ## Rollback Procedures ### Quick Rollback (< 5 minutes) ```bash # 1. Stop current version sudo systemctl stop mev-bot # 2. Restore previous binary sudo cp /opt/mev-bot/bin/mev-bot.backup /opt/mev-bot/bin/mev-bot # 3. Restart service sudo systemctl start mev-bot # 4. Verify rollback sudo systemctl status mev-bot tail -100 /var/log/mev-bot/mev_bot.log ``` ### Full Rollback (< 15 minutes) ```bash # 1. Stop service sudo systemctl stop mev-bot # 2. Checkout previous version cd /opt/mev-bot git fetch git checkout # 3. Rebuild make build # 4. Restart service sudo systemctl start mev-bot # 5. Validate curl http://localhost:9090/health/live ``` --- ## Troubleshooting ### Common Issues #### Issue: Bot fails to start **Symptoms:** ``` systemctl status mev-bot ● mev-bot.service - MEV Arbitrage Bot Loaded: loaded Active: failed (Result: exit-code) ``` **Diagnosis:** ```bash # Check logs sudo journalctl -u mev-bot -n 100 --no-pager # Common causes: # 1. Missing environment variables # 2. Invalid RPC endpoint # 3. Permission issues ``` **Solution:** ```bash # Verify environment file cat /etc/systemd/system/mev-bot.env # Test RPC connection manually curl -X POST -H "Content-Type: application/json" \ --data '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}' \ $ARBITRUM_RPC_ENDPOINT # Fix permissions sudo chown -R mev-bot:mev-bot /opt/mev-bot ``` --- #### Issue: High error rate **Symptoms:** ``` [ERROR] Failed to fetch pool state [ERROR] RPC call failed [ERROR] 429 Too Many Requests ``` **Diagnosis:** ```bash # Check error rate ./scripts/log-manager.sh analyze | grep "Error Rate" # Check RPC provider status curl -s $ARBITRUM_RPC_ENDPOINT ``` **Solution:** ```bash # 1. Enable backup RPC endpoint in config # 2. Reduce rate limits # 3. Contact RPC provider # 4. Switch to different provider ``` --- #### Issue: No opportunities detected **Symptoms:** ``` Blocks processed: 10000 Opportunities detected: 0 ``` **Diagnosis:** ```bash # Check if events are being detected tail -100 /var/log/mev-bot/mev_bot.log | grep "processing.*event" # Check profit thresholds grep MIN_PROFIT_THRESHOLD /etc/systemd/system/mev-bot.env ``` **Solution:** ```bash # 1. Lower MIN_PROFIT_THRESHOLD (carefully!) # 2. Check market conditions (volatility) # 3. Verify DEX integrations working # 4. Review price impact thresholds ``` --- #### Issue: Memory leak **Symptoms:** ``` Memory usage increasing over time OOM killer may terminate process ``` **Diagnosis:** ```bash # Monitor memory over time watch -n 10 'ps aux | grep mev-bot | grep -v grep' # Generate heap profile curl http://localhost:9090/debug/pprof/heap > heap.prof go tool pprof heap.prof ``` **Solution:** ```bash # 1. Restart service (temporary fix) sudo systemctl restart mev-bot # 2. Investigate with profiler # 3. Check for goroutine leaks curl http://localhost:9090/debug/pprof/goroutine?debug=1 # 4. May need code fix and redeploy ``` --- ## Emergency Contacts | Role | Name | Contact | Availability | |------|------|---------|--------------| | On-Call Engineer | TBD | +1-XXX-XXX-XXXX | 24/7 | | DevOps Lead | TBD | Slack: @devops | Business hours | | Product Owner | TBD | Email: product@company.com | Business hours | ## Change Log | Date | Version | Changes | Author | |------|---------|---------|--------| | 2025-10-28 | 1.0 | Initial runbook | Claude Code | --- **END OF RUNBOOK** **Remember:** 1. Always test in staging first 2. Have rollback plan ready 3. Monitor closely after deployment 4. Document any issues encountered 5. Keep this runbook updated