Files
mev-beta/docs/PRODUCTION_DEPLOYMENT_RUNBOOK.md
Krypto Kajun 0cbbd20b5b feat(optimization): add pool detection, price impact validation, and production infrastructure
This commit adds critical production-ready optimizations and infrastructure:

New Features:

1. Pool Version Detector - Detects pool versions before calling slot0()
   - Eliminates ABI unpacking errors from V2 pools
   - Caches detection results for performance

2. Price Impact Validation System - Comprehensive risk categorization
   - Three threshold profiles (Conservative, Default, Aggressive)
   - Automatic trade splitting recommendations
   - All tests passing (10/10)

3. Flash Loan Execution Architecture - Complete execution flow design
   - Multi-provider support (Aave, Balancer, Uniswap)
   - Safety and risk management systems
   - Transaction signing and dispatch strategies

4. 24-Hour Validation Test Infrastructure - Production testing framework
   - Comprehensive monitoring with real-time metrics
   - Automatic report generation
   - System health tracking

5. Production Deployment Runbook - Complete deployment procedures
   - Pre-deployment checklist
   - Configuration templates
   - Monitoring and rollback procedures

Files Added:
- pkg/uniswap/pool_detector.go (273 lines)
- pkg/validation/price_impact_validator.go (265 lines)
- pkg/validation/price_impact_validator_test.go (242 lines)
- docs/architecture/flash_loan_execution_architecture.md (808 lines)
- docs/PRODUCTION_DEPLOYMENT_RUNBOOK.md (615 lines)
- scripts/24h-validation-test.sh (352 lines)

Testing: Core functionality tests passing. Stress test showing 867 TPS (below 1000 TPS target - to be investigated)

Impact: Ready for 24-hour validation test and production deployment

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-28 21:33:30 -05:00

12 KiB

MEV Bot - Production Deployment Runbook

Version: 1.0 Last Updated: October 28, 2025 Audience: DevOps, Production Engineers


Table of Contents

  1. Pre-Deployment Checklist
  2. Environment Setup
  3. Configuration
  4. Deployment Steps
  5. Post-Deployment Validation
  6. Monitoring & Alerting
  7. Rollback Procedures
  8. Troubleshooting

Pre-Deployment Checklist

Code Readiness

  • All tests passing (make test)
  • Security audit completed and issues addressed
  • Code review approved
  • 24-hour validation test completed successfully
  • Performance benchmarks meet targets
  • No critical TODOs in codebase

Infrastructure Readiness

  • RPC endpoints configured and tested
  • Private key/wallet funded with gas (minimum 0.1 ETH)
  • Monitoring systems operational
  • Alert channels configured (Slack, email, PagerDuty)
  • Backup RPC endpoints ready
  • Database/storage systems ready

Team Readiness

  • On-call engineer assigned
  • Runbook reviewed by team
  • Communication channels established
  • Rollback plan understood
  • Emergency contacts documented

Environment Setup

System Requirements

Minimum:

  • CPU: 4 cores
  • RAM: 8 GB
  • Disk: 50 GB SSD
  • Network: 100 Mbps, low latency

Recommended (Production):

  • CPU: 8 cores
  • RAM: 16 GB
  • Disk: 100 GB NVMe SSD
  • Network: 1 Gbps, < 20ms latency to Arbitrum RPC

Dependencies

# Install Go 1.24+
wget https://go.dev/dl/go1.24.linux-amd64.tar.gz
sudo tar -C /usr/local -xzf go1.24.linux-amd64.tar.gz
export PATH=$PATH:/usr/local/go/bin

# Verify installation
go version  # Should show go1.24 or later

# Install build tools
sudo apt-get update
sudo apt-get install -y build-essential git curl

Repository Setup

# Clone repository
git clone https://github.com/your-org/mev-beta.git
cd mev-beta

# Checkout production branch
git checkout feature/production-profit-optimization

# Verify correct branch
git log -1 --oneline

# Install dependencies
go mod download
go mod verify

Configuration

1. Environment Variables

Create /etc/systemd/system/mev-bot.env:

# RPC Configuration
ARBITRUM_RPC_ENDPOINT=https://arbitrum-mainnet.core.chainstack.com/YOUR_KEY
ARBITRUM_WS_ENDPOINT=wss://arbitrum-mainnet.core.chainstack.com/YOUR_KEY

# Backup RPC (fallback)
BACKUP_RPC_ENDPOINT=https://arb1.arbitrum.io/rpc

# Application Configuration
LOG_LEVEL=info
LOG_FORMAT=json
LOG_OUTPUT=/var/log/mev-bot/mev_bot.log

# Metrics & Monitoring
METRICS_ENABLED=true
METRICS_PORT=9090

# Security
MEV_BOT_ENCRYPTION_KEY=your-32-char-encryption-key-here-minimum-length-required

# Execution Configuration (IMPORTANT: Set to false for detection-only mode)
EXECUTION_ENABLED=false
MAX_POSITION_SIZE=1000000000000000000  # 1 ETH in wei
MIN_PROFIT_THRESHOLD=50000000000000000  # 0.05 ETH in wei

# Provider Configuration
PROVIDER_CONFIG_PATH=/opt/mev-bot/config/providers_runtime.yaml

CRITICAL: Never commit .env files with real credentials to version control!

2. Provider Configuration

Edit config/providers_runtime.yaml:

providers:
  - name: "chainstack-primary"
    endpoint: "${ARBITRUM_RPC_ENDPOINT}"
    type: "https"
    weight: 100
    timeout: 30s
    rateLimit: 100

  - name: "chainstack-websocket"
    endpoint: "${ARBITRUM_WS_ENDPOINT}"
    type: "wss"
    weight: 90
    timeout: 30s
    rateLimit: 100

  - name: "public-fallback"
    endpoint: "https://arb1.arbitrum.io/rpc"
    type: "https"
    weight: 50
    timeout: 30s
    rateLimit: 50

pooling:
  maxIdleConnections: 10
  maxOpenConnections: 50
  connectionTimeout: 30s
  idleTimeout: 300s

retry:
  maxRetries: 3
  retryDelay: 1s
  backoffMultiplier: 2
  maxBackoff: 8s

3. Systemd Service Configuration

Create /etc/systemd/system/mev-bot.service:

[Unit]
Description=MEV Arbitrage Bot
After=network.target
Wants=network-online.target

[Service]
Type=simple
User=mev-bot
Group=mev-bot
WorkingDirectory=/opt/mev-bot
EnvironmentFile=/etc/systemd/system/mev-bot.env

ExecStart=/opt/mev-bot/bin/mev-bot start
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
Restart=on-failure
RestartSec=10s

# Resource limits
LimitNOFILE=65536
MemoryMax=4G
CPUQuota=400%

# Security hardening
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/var/log/mev-bot /opt/mev-bot/data

# Logging
StandardOutput=journal
StandardError=journal
SyslogIdentifier=mev-bot

[Install]
WantedBy=multi-user.target

Deployment Steps

Phase 1: Build & Prepare (10-15 minutes)

# 1. Build binary
cd /opt/mev-bot
make build

# Verify binary
./bin/mev-bot --version
# Expected: MEV Bot v1.0.0 (or similar)

# 2. Run tests
make test
# Ensure all tests pass

# 3. Check binary size and dependencies
ls -lh bin/mev-bot
ldd bin/mev-bot  # Should show minimal dependencies

# 4. Create necessary directories
sudo mkdir -p /var/log/mev-bot
sudo mkdir -p /opt/mev-bot/data
sudo chown -R mev-bot:mev-bot /var/log/mev-bot /opt/mev-bot/data

# 5. Set permissions
chmod +x bin/mev-bot
chmod 600 /etc/systemd/system/mev-bot.env  # Protect sensitive config

Phase 2: Dry Run (5-10 minutes)

# Run bot in foreground to verify configuration
sudo -u mev-bot /opt/mev-bot/bin/mev-bot start &
BOT_PID=$!

# Wait 2 minutes for initialization
sleep 120

# Check if running
ps aux | grep mev-bot

# Check logs for errors
tail -100 /var/log/mev-bot/mev_bot.log | grep -i error

# Verify RPC connection
tail -100 /var/log/mev-bot/mev_bot.log | grep -i "connected"

# Stop dry run
kill $BOT_PID

Phase 3: Production Start (5 minutes)

# 1. Reload systemd
sudo systemctl daemon-reload

# 2. Enable service (start on boot)
sudo systemctl enable mev-bot

# 3. Start service
sudo systemctl start mev-bot

# 4. Verify status
sudo systemctl status mev-bot
# Expected: active (running)

# 5. Check logs
sudo journalctl -u mev-bot -f --lines=50

# 6. Wait for initialization (30-60 seconds)
sleep 60

# 7. Verify healthy operation
curl -s http://localhost:9090/health/live | jq .
# Expected: {"status": "healthy"}

Phase 4: Validation (15-30 minutes)

# 1. Monitor for opportunities
tail -f /var/log/mev-bot/mev_bot.log | grep "ARBITRAGE OPPORTUNITY"

# 2. Check metrics endpoint
curl -s http://localhost:9090/metrics | grep mev_

# 3. Verify cache performance
tail -100 /var/log/mev-bot/mev_bot.log | grep "cache metrics"
# Look for hit rate 75-85%

# 4. Check for errors
sudo journalctl -u mev-bot --since "10 minutes ago" | grep ERROR
# Should have minimal errors

# 5. Monitor resource usage
htop  # Check CPU and memory
# CPU should be 50-80%, Memory < 2GB

# 6. Test failover (optional)
# Temporarily block primary RPC, verify fallback works

Post-Deployment Validation

Health Checks

# Liveness probe (should return 200)
curl -f http://localhost:9090/health/live || echo "LIVENESS FAILED"

# Readiness probe (should return 200)
curl -f http://localhost:9090/health/ready || echo "READINESS FAILED"

# Startup probe (should return 200 after initialization)
curl -f http://localhost:9090/health/startup || echo "STARTUP FAILED"

Performance Metrics

# Check Prometheus metrics
curl -s http://localhost:9090/metrics | grep -E "mev_(opportunities|executions|profit)"

# Expected metrics:
# - mev_opportunities_detected{} <number>
# - mev_opportunities_profitable{} <number>
# - mev_cache_hit_rate{} 0.75-0.85
# - mev_rpc_calls_total{} <number>

Log Analysis

# Analyze last hour of logs
./scripts/log-manager.sh analyze

# Check health score (target: > 90)
./scripts/log-manager.sh health

# Expected output:
# Health Score: 95.5/100 (Excellent)
# Error Rate: < 5%
# Cache Hit Rate: 75-85%

Monitoring & Alerting

Key Metrics to Monitor

Metric Threshold Action
CPU Usage > 90% Scale up or investigate
Memory Usage > 85% Potential memory leak
Error Rate > 10% Check logs, may need rollback
RPC Failures > 5/min Check RPC provider
Opportunities/hour < 1 May indicate detection issue
Cache Hit Rate < 70% Review cache configuration

Alert Configuration

Slack Webhook (edit in config/alerts.yaml):

alerts:
  slack:
    enabled: true
    webhook_url: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
    channel: "#mev-bot-alerts"

  thresholds:
    error_rate: 0.10  # 10%
    cpu_usage: 0.90   # 90%
    memory_usage: 0.85  # 85%
    min_opportunities_per_hour: 1

Monitoring Commands

# Real-time monitoring
watch -n 5 'systemctl status mev-bot && curl -s http://localhost:9090/metrics | grep mev_'

# Start monitoring daemon (background)
./scripts/log-manager.sh start-daemon

# View operations dashboard
./scripts/log-manager.sh dashboard
# Opens HTML dashboard in browser

Rollback Procedures

Quick Rollback (< 5 minutes)

# 1. Stop current version
sudo systemctl stop mev-bot

# 2. Restore previous binary
sudo cp /opt/mev-bot/bin/mev-bot.backup /opt/mev-bot/bin/mev-bot

# 3. Restart service
sudo systemctl start mev-bot

# 4. Verify rollback
sudo systemctl status mev-bot
tail -100 /var/log/mev-bot/mev_bot.log

Full Rollback (< 15 minutes)

# 1. Stop service
sudo systemctl stop mev-bot

# 2. Checkout previous version
cd /opt/mev-bot
git fetch
git checkout <previous-commit-hash>

# 3. Rebuild
make build

# 4. Restart service
sudo systemctl start mev-bot

# 5. Validate
curl http://localhost:9090/health/live

Troubleshooting

Common Issues

Issue: Bot fails to start

Symptoms:

systemctl status mev-bot
● mev-bot.service - MEV Arbitrage Bot
   Loaded: loaded
   Active: failed (Result: exit-code)

Diagnosis:

# Check logs
sudo journalctl -u mev-bot -n 100 --no-pager

# Common causes:
# 1. Missing environment variables
# 2. Invalid RPC endpoint
# 3. Permission issues

Solution:

# Verify environment file
cat /etc/systemd/system/mev-bot.env

# Test RPC connection manually
curl -X POST -H "Content-Type: application/json" \
  --data '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}' \
  $ARBITRUM_RPC_ENDPOINT

# Fix permissions
sudo chown -R mev-bot:mev-bot /opt/mev-bot

Issue: High error rate

Symptoms:

[ERROR] Failed to fetch pool state
[ERROR] RPC call failed
[ERROR] 429 Too Many Requests

Diagnosis:

# Check error rate
./scripts/log-manager.sh analyze | grep "Error Rate"

# Check RPC provider status
curl -s $ARBITRUM_RPC_ENDPOINT

Solution:

# 1. Enable backup RPC endpoint in config
# 2. Reduce rate limits
# 3. Contact RPC provider
# 4. Switch to different provider

Issue: No opportunities detected

Symptoms:

Blocks processed: 10000
Opportunities detected: 0

Diagnosis:

# Check if events are being detected
tail -100 /var/log/mev-bot/mev_bot.log | grep "processing.*event"

# Check profit thresholds
grep MIN_PROFIT_THRESHOLD /etc/systemd/system/mev-bot.env

Solution:

# 1. Lower MIN_PROFIT_THRESHOLD (carefully!)
# 2. Check market conditions (volatility)
# 3. Verify DEX integrations working
# 4. Review price impact thresholds

Issue: Memory leak

Symptoms:

Memory usage increasing over time
OOM killer may terminate process

Diagnosis:

# Monitor memory over time
watch -n 10 'ps aux | grep mev-bot | grep -v grep'

# Generate heap profile
curl http://localhost:9090/debug/pprof/heap > heap.prof
go tool pprof heap.prof

Solution:

# 1. Restart service (temporary fix)
sudo systemctl restart mev-bot

# 2. Investigate with profiler
# 3. Check for goroutine leaks
curl http://localhost:9090/debug/pprof/goroutine?debug=1

# 4. May need code fix and redeploy

Emergency Contacts

Role Name Contact Availability
On-Call Engineer TBD +1-XXX-XXX-XXXX 24/7
DevOps Lead TBD Slack: @devops Business hours
Product Owner TBD Email: product@company.com Business hours

Change Log

Date Version Changes Author
2025-10-28 1.0 Initial runbook Claude Code

END OF RUNBOOK

Remember:

  1. Always test in staging first
  2. Have rollback plan ready
  3. Monitor closely after deployment
  4. Document any issues encountered
  5. Keep this runbook updated