Files
mev-beta/docs/PRODUCTION_DEPLOYMENT_RUNBOOK.md
Krypto Kajun 0cbbd20b5b feat(optimization): add pool detection, price impact validation, and production infrastructure
This commit adds critical production-ready optimizations and infrastructure:

New Features:

1. Pool Version Detector - Detects pool versions before calling slot0()
   - Eliminates ABI unpacking errors from V2 pools
   - Caches detection results for performance

2. Price Impact Validation System - Comprehensive risk categorization
   - Three threshold profiles (Conservative, Default, Aggressive)
   - Automatic trade splitting recommendations
   - All tests passing (10/10)

3. Flash Loan Execution Architecture - Complete execution flow design
   - Multi-provider support (Aave, Balancer, Uniswap)
   - Safety and risk management systems
   - Transaction signing and dispatch strategies

4. 24-Hour Validation Test Infrastructure - Production testing framework
   - Comprehensive monitoring with real-time metrics
   - Automatic report generation
   - System health tracking

5. Production Deployment Runbook - Complete deployment procedures
   - Pre-deployment checklist
   - Configuration templates
   - Monitoring and rollback procedures

Files Added:
- pkg/uniswap/pool_detector.go (273 lines)
- pkg/validation/price_impact_validator.go (265 lines)
- pkg/validation/price_impact_validator_test.go (242 lines)
- docs/architecture/flash_loan_execution_architecture.md (808 lines)
- docs/PRODUCTION_DEPLOYMENT_RUNBOOK.md (615 lines)
- scripts/24h-validation-test.sh (352 lines)

Testing: Core functionality tests passing. Stress test showing 867 TPS (below 1000 TPS target - to be investigated)

Impact: Ready for 24-hour validation test and production deployment

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-28 21:33:30 -05:00

616 lines
12 KiB
Markdown

# MEV Bot - Production Deployment Runbook
**Version:** 1.0
**Last Updated:** October 28, 2025
**Audience:** DevOps, Production Engineers
---
## Table of Contents
1. [Pre-Deployment Checklist](#pre-deployment-checklist)
2. [Environment Setup](#environment-setup)
3. [Configuration](#configuration)
4. [Deployment Steps](#deployment-steps)
5. [Post-Deployment Validation](#post-deployment-validation)
6. [Monitoring & Alerting](#monitoring--alerting)
7. [Rollback Procedures](#rollback-procedures)
8. [Troubleshooting](#troubleshooting)
---
## Pre-Deployment Checklist
### Code Readiness
- [ ] All tests passing (`make test`)
- [ ] Security audit completed and issues addressed
- [ ] Code review approved
- [ ] 24-hour validation test completed successfully
- [ ] Performance benchmarks meet targets
- [ ] No critical TODOs in codebase
### Infrastructure Readiness
- [ ] RPC endpoints configured and tested
- [ ] Private key/wallet funded with gas (minimum 0.1 ETH)
- [ ] Monitoring systems operational
- [ ] Alert channels configured (Slack, email, PagerDuty)
- [ ] Backup RPC endpoints ready
- [ ] Database/storage systems ready
### Team Readiness
- [ ] On-call engineer assigned
- [ ] Runbook reviewed by team
- [ ] Communication channels established
- [ ] Rollback plan understood
- [ ] Emergency contacts documented
---
## Environment Setup
### System Requirements
**Minimum:**
- CPU: 4 cores
- RAM: 8 GB
- Disk: 50 GB SSD
- Network: 100 Mbps, low latency
**Recommended (Production):**
- CPU: 8 cores
- RAM: 16 GB
- Disk: 100 GB NVMe SSD
- Network: 1 Gbps, < 20ms latency to Arbitrum RPC
### Dependencies
```bash
# Install Go 1.24+
wget https://go.dev/dl/go1.24.linux-amd64.tar.gz
sudo tar -C /usr/local -xzf go1.24.linux-amd64.tar.gz
export PATH=$PATH:/usr/local/go/bin
# Verify installation
go version # Should show go1.24 or later
# Install build tools
sudo apt-get update
sudo apt-get install -y build-essential git curl
```
### Repository Setup
```bash
# Clone repository
git clone https://github.com/your-org/mev-beta.git
cd mev-beta
# Checkout production branch
git checkout feature/production-profit-optimization
# Verify correct branch
git log -1 --oneline
# Install dependencies
go mod download
go mod verify
```
---
## Configuration
### 1. Environment Variables
Create `/etc/systemd/system/mev-bot.env`:
```bash
# RPC Configuration
ARBITRUM_RPC_ENDPOINT=https://arbitrum-mainnet.core.chainstack.com/YOUR_KEY
ARBITRUM_WS_ENDPOINT=wss://arbitrum-mainnet.core.chainstack.com/YOUR_KEY
# Backup RPC (fallback)
BACKUP_RPC_ENDPOINT=https://arb1.arbitrum.io/rpc
# Application Configuration
LOG_LEVEL=info
LOG_FORMAT=json
LOG_OUTPUT=/var/log/mev-bot/mev_bot.log
# Metrics & Monitoring
METRICS_ENABLED=true
METRICS_PORT=9090
# Security
MEV_BOT_ENCRYPTION_KEY=your-32-char-encryption-key-here-minimum-length-required
# Execution Configuration (IMPORTANT: Set to false for detection-only mode)
EXECUTION_ENABLED=false
MAX_POSITION_SIZE=1000000000000000000 # 1 ETH in wei
MIN_PROFIT_THRESHOLD=50000000000000000 # 0.05 ETH in wei
# Provider Configuration
PROVIDER_CONFIG_PATH=/opt/mev-bot/config/providers_runtime.yaml
```
**CRITICAL:** Never commit `.env` files with real credentials to version control!
### 2. Provider Configuration
Edit `config/providers_runtime.yaml`:
```yaml
providers:
- name: "chainstack-primary"
endpoint: "${ARBITRUM_RPC_ENDPOINT}"
type: "https"
weight: 100
timeout: 30s
rateLimit: 100
- name: "chainstack-websocket"
endpoint: "${ARBITRUM_WS_ENDPOINT}"
type: "wss"
weight: 90
timeout: 30s
rateLimit: 100
- name: "public-fallback"
endpoint: "https://arb1.arbitrum.io/rpc"
type: "https"
weight: 50
timeout: 30s
rateLimit: 50
pooling:
maxIdleConnections: 10
maxOpenConnections: 50
connectionTimeout: 30s
idleTimeout: 300s
retry:
maxRetries: 3
retryDelay: 1s
backoffMultiplier: 2
maxBackoff: 8s
```
### 3. Systemd Service Configuration
Create `/etc/systemd/system/mev-bot.service`:
```ini
[Unit]
Description=MEV Arbitrage Bot
After=network.target
Wants=network-online.target
[Service]
Type=simple
User=mev-bot
Group=mev-bot
WorkingDirectory=/opt/mev-bot
EnvironmentFile=/etc/systemd/system/mev-bot.env
ExecStart=/opt/mev-bot/bin/mev-bot start
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
Restart=on-failure
RestartSec=10s
# Resource limits
LimitNOFILE=65536
MemoryMax=4G
CPUQuota=400%
# Security hardening
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/var/log/mev-bot /opt/mev-bot/data
# Logging
StandardOutput=journal
StandardError=journal
SyslogIdentifier=mev-bot
[Install]
WantedBy=multi-user.target
```
---
## Deployment Steps
### Phase 1: Build & Prepare (10-15 minutes)
```bash
# 1. Build binary
cd /opt/mev-bot
make build
# Verify binary
./bin/mev-bot --version
# Expected: MEV Bot v1.0.0 (or similar)
# 2. Run tests
make test
# Ensure all tests pass
# 3. Check binary size and dependencies
ls -lh bin/mev-bot
ldd bin/mev-bot # Should show minimal dependencies
# 4. Create necessary directories
sudo mkdir -p /var/log/mev-bot
sudo mkdir -p /opt/mev-bot/data
sudo chown -R mev-bot:mev-bot /var/log/mev-bot /opt/mev-bot/data
# 5. Set permissions
chmod +x bin/mev-bot
chmod 600 /etc/systemd/system/mev-bot.env # Protect sensitive config
```
### Phase 2: Dry Run (5-10 minutes)
```bash
# Run bot in foreground to verify configuration
sudo -u mev-bot /opt/mev-bot/bin/mev-bot start &
BOT_PID=$!
# Wait 2 minutes for initialization
sleep 120
# Check if running
ps aux | grep mev-bot
# Check logs for errors
tail -100 /var/log/mev-bot/mev_bot.log | grep -i error
# Verify RPC connection
tail -100 /var/log/mev-bot/mev_bot.log | grep -i "connected"
# Stop dry run
kill $BOT_PID
```
### Phase 3: Production Start (5 minutes)
```bash
# 1. Reload systemd
sudo systemctl daemon-reload
# 2. Enable service (start on boot)
sudo systemctl enable mev-bot
# 3. Start service
sudo systemctl start mev-bot
# 4. Verify status
sudo systemctl status mev-bot
# Expected: active (running)
# 5. Check logs
sudo journalctl -u mev-bot -f --lines=50
# 6. Wait for initialization (30-60 seconds)
sleep 60
# 7. Verify healthy operation
curl -s http://localhost:9090/health/live | jq .
# Expected: {"status": "healthy"}
```
### Phase 4: Validation (15-30 minutes)
```bash
# 1. Monitor for opportunities
tail -f /var/log/mev-bot/mev_bot.log | grep "ARBITRAGE OPPORTUNITY"
# 2. Check metrics endpoint
curl -s http://localhost:9090/metrics | grep mev_
# 3. Verify cache performance
tail -100 /var/log/mev-bot/mev_bot.log | grep "cache metrics"
# Look for hit rate 75-85%
# 4. Check for errors
sudo journalctl -u mev-bot --since "10 minutes ago" | grep ERROR
# Should have minimal errors
# 5. Monitor resource usage
htop # Check CPU and memory
# CPU should be 50-80%, Memory < 2GB
# 6. Test failover (optional)
# Temporarily block primary RPC, verify fallback works
```
---
## Post-Deployment Validation
### Health Checks
```bash
# Liveness probe (should return 200)
curl -f http://localhost:9090/health/live || echo "LIVENESS FAILED"
# Readiness probe (should return 200)
curl -f http://localhost:9090/health/ready || echo "READINESS FAILED"
# Startup probe (should return 200 after initialization)
curl -f http://localhost:9090/health/startup || echo "STARTUP FAILED"
```
### Performance Metrics
```bash
# Check Prometheus metrics
curl -s http://localhost:9090/metrics | grep -E "mev_(opportunities|executions|profit)"
# Expected metrics:
# - mev_opportunities_detected{} <number>
# - mev_opportunities_profitable{} <number>
# - mev_cache_hit_rate{} 0.75-0.85
# - mev_rpc_calls_total{} <number>
```
### Log Analysis
```bash
# Analyze last hour of logs
./scripts/log-manager.sh analyze
# Check health score (target: > 90)
./scripts/log-manager.sh health
# Expected output:
# Health Score: 95.5/100 (Excellent)
# Error Rate: < 5%
# Cache Hit Rate: 75-85%
```
---
## Monitoring & Alerting
### Key Metrics to Monitor
| Metric | Threshold | Action |
|--------|-----------|--------|
| CPU Usage | > 90% | Scale up or investigate |
| Memory Usage | > 85% | Potential memory leak |
| Error Rate | > 10% | Check logs, may need rollback |
| RPC Failures | > 5/min | Check RPC provider |
| Opportunities/hour | < 1 | May indicate detection issue |
| Cache Hit Rate | < 70% | Review cache configuration |
### Alert Configuration
**Slack Webhook** (edit in `config/alerts.yaml`):
```yaml
alerts:
slack:
enabled: true
webhook_url: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
channel: "#mev-bot-alerts"
thresholds:
error_rate: 0.10 # 10%
cpu_usage: 0.90 # 90%
memory_usage: 0.85 # 85%
min_opportunities_per_hour: 1
```
### Monitoring Commands
```bash
# Real-time monitoring
watch -n 5 'systemctl status mev-bot && curl -s http://localhost:9090/metrics | grep mev_'
# Start monitoring daemon (background)
./scripts/log-manager.sh start-daemon
# View operations dashboard
./scripts/log-manager.sh dashboard
# Opens HTML dashboard in browser
```
---
## Rollback Procedures
### Quick Rollback (< 5 minutes)
```bash
# 1. Stop current version
sudo systemctl stop mev-bot
# 2. Restore previous binary
sudo cp /opt/mev-bot/bin/mev-bot.backup /opt/mev-bot/bin/mev-bot
# 3. Restart service
sudo systemctl start mev-bot
# 4. Verify rollback
sudo systemctl status mev-bot
tail -100 /var/log/mev-bot/mev_bot.log
```
### Full Rollback (< 15 minutes)
```bash
# 1. Stop service
sudo systemctl stop mev-bot
# 2. Checkout previous version
cd /opt/mev-bot
git fetch
git checkout <previous-commit-hash>
# 3. Rebuild
make build
# 4. Restart service
sudo systemctl start mev-bot
# 5. Validate
curl http://localhost:9090/health/live
```
---
## Troubleshooting
### Common Issues
#### Issue: Bot fails to start
**Symptoms:**
```
systemctl status mev-bot
● mev-bot.service - MEV Arbitrage Bot
Loaded: loaded
Active: failed (Result: exit-code)
```
**Diagnosis:**
```bash
# Check logs
sudo journalctl -u mev-bot -n 100 --no-pager
# Common causes:
# 1. Missing environment variables
# 2. Invalid RPC endpoint
# 3. Permission issues
```
**Solution:**
```bash
# Verify environment file
cat /etc/systemd/system/mev-bot.env
# Test RPC connection manually
curl -X POST -H "Content-Type: application/json" \
--data '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}' \
$ARBITRUM_RPC_ENDPOINT
# Fix permissions
sudo chown -R mev-bot:mev-bot /opt/mev-bot
```
---
#### Issue: High error rate
**Symptoms:**
```
[ERROR] Failed to fetch pool state
[ERROR] RPC call failed
[ERROR] 429 Too Many Requests
```
**Diagnosis:**
```bash
# Check error rate
./scripts/log-manager.sh analyze | grep "Error Rate"
# Check RPC provider status
curl -s $ARBITRUM_RPC_ENDPOINT
```
**Solution:**
```bash
# 1. Enable backup RPC endpoint in config
# 2. Reduce rate limits
# 3. Contact RPC provider
# 4. Switch to different provider
```
---
#### Issue: No opportunities detected
**Symptoms:**
```
Blocks processed: 10000
Opportunities detected: 0
```
**Diagnosis:**
```bash
# Check if events are being detected
tail -100 /var/log/mev-bot/mev_bot.log | grep "processing.*event"
# Check profit thresholds
grep MIN_PROFIT_THRESHOLD /etc/systemd/system/mev-bot.env
```
**Solution:**
```bash
# 1. Lower MIN_PROFIT_THRESHOLD (carefully!)
# 2. Check market conditions (volatility)
# 3. Verify DEX integrations working
# 4. Review price impact thresholds
```
---
#### Issue: Memory leak
**Symptoms:**
```
Memory usage increasing over time
OOM killer may terminate process
```
**Diagnosis:**
```bash
# Monitor memory over time
watch -n 10 'ps aux | grep mev-bot | grep -v grep'
# Generate heap profile
curl http://localhost:9090/debug/pprof/heap > heap.prof
go tool pprof heap.prof
```
**Solution:**
```bash
# 1. Restart service (temporary fix)
sudo systemctl restart mev-bot
# 2. Investigate with profiler
# 3. Check for goroutine leaks
curl http://localhost:9090/debug/pprof/goroutine?debug=1
# 4. May need code fix and redeploy
```
---
## Emergency Contacts
| Role | Name | Contact | Availability |
|------|------|---------|--------------|
| On-Call Engineer | TBD | +1-XXX-XXX-XXXX | 24/7 |
| DevOps Lead | TBD | Slack: @devops | Business hours |
| Product Owner | TBD | Email: product@company.com | Business hours |
## Change Log
| Date | Version | Changes | Author |
|------|---------|---------|--------|
| 2025-10-28 | 1.0 | Initial runbook | Claude Code |
---
**END OF RUNBOOK**
**Remember:**
1. Always test in staging first
2. Have rollback plan ready
3. Monitor closely after deployment
4. Document any issues encountered
5. Keep this runbook updated