mev-beta/docs/PRODUCTION_DEPLOYMENT_RUNBOOK.md

# MEV Bot - Production Deployment Runbook
**Version:** 1.0
**Last Updated:** October 28, 2025
**Audience:** DevOps, Production Engineers

---

## Table of Contents

1. [Pre-Deployment Checklist](#pre-deployment-checklist)
2. [Environment Setup](#environment-setup)
3. [Configuration](#configuration)
4. [Deployment Steps](#deployment-steps)
5. [Post-Deployment Validation](#post-deployment-validation)
6. [Monitoring & Alerting](#monitoring--alerting)
7. [Rollback Procedures](#rollback-procedures)
8. [Troubleshooting](#troubleshooting)

---

## Pre-Deployment Checklist

### Code Readiness
- [ ] All tests passing (`make test`)
- [ ] Security audit completed and issues addressed
- [ ] Code review approved
- [ ] 24-hour validation test completed successfully
- [ ] Performance benchmarks meet targets
- [ ] No critical TODOs in codebase

### Infrastructure Readiness
- [ ] RPC endpoints configured and tested
- [ ] Private key/wallet funded with gas (minimum 0.1 ETH)
- [ ] Monitoring systems operational
- [ ] Alert channels configured (Slack, email, PagerDuty)
- [ ] Backup RPC endpoints ready
- [ ] Database/storage systems ready

### Team Readiness
- [ ] On-call engineer assigned
- [ ] Runbook reviewed by team
- [ ] Communication channels established
- [ ] Rollback plan understood
- [ ] Emergency contacts documented

---

## Environment Setup

### System Requirements

**Minimum:**
- CPU: 4 cores
- RAM: 8 GB
- Disk: 50 GB SSD
- Network: 100 Mbps, low latency

**Recommended (Production):**
- CPU: 8 cores
- RAM: 16 GB
- Disk: 100 GB NVMe SSD
- Network: 1 Gbps, < 20ms latency to Arbitrum RPC

### Dependencies

```bash
# Install Go 1.24+
wget https://go.dev/dl/go1.24.linux-amd64.tar.gz
sudo tar -C /usr/local -xzf go1.24.linux-amd64.tar.gz
export PATH=$PATH:/usr/local/go/bin

# Verify installation
go version  # Should show go1.24 or later

# Install build tools
sudo apt-get update
sudo apt-get install -y build-essential git curl
```

### Repository Setup

```bash
# Clone repository
git clone https://github.com/your-org/mev-beta.git
cd mev-beta

# Checkout production branch
git checkout feature/production-profit-optimization

# Verify correct branch
git log -1 --oneline

# Install dependencies
go mod download
go mod verify
```

---

## Configuration

### 1. Environment Variables

Create `/etc/systemd/system/mev-bot.env`:

```bash
# RPC Configuration
ARBITRUM_RPC_ENDPOINT=https://arbitrum-mainnet.core.chainstack.com/YOUR_KEY
ARBITRUM_WS_ENDPOINT=wss://arbitrum-mainnet.core.chainstack.com/YOUR_KEY

# Backup RPC (fallback)
BACKUP_RPC_ENDPOINT=https://arb1.arbitrum.io/rpc

# Application Configuration
LOG_LEVEL=info
LOG_FORMAT=json
LOG_OUTPUT=/var/log/mev-bot/mev_bot.log

# Metrics & Monitoring
METRICS_ENABLED=true
METRICS_PORT=9090

# Security
MEV_BOT_ENCRYPTION_KEY=your-32-char-encryption-key-here-minimum-length-required

# Execution Configuration (IMPORTANT: Set to false for detection-only mode)
EXECUTION_ENABLED=false
MAX_POSITION_SIZE=1000000000000000000  # 1 ETH in wei
MIN_PROFIT_THRESHOLD=50000000000000000  # 0.05 ETH in wei

# Provider Configuration
PROVIDER_CONFIG_PATH=/opt/mev-bot/config/providers_runtime.yaml
```

**CRITICAL:** Never commit `.env` files with real credentials to version control!

### 2. Provider Configuration

Edit `config/providers_runtime.yaml`:

```yaml
providers:
  - name: "chainstack-primary"
    endpoint: "${ARBITRUM_RPC_ENDPOINT}"
    type: "https"
    weight: 100
    timeout: 30s
    rateLimit: 100

  - name: "chainstack-websocket"
    endpoint: "${ARBITRUM_WS_ENDPOINT}"
    type: "wss"
    weight: 90
    timeout: 30s
    rateLimit: 100

  - name: "public-fallback"
    endpoint: "https://arb1.arbitrum.io/rpc"
    type: "https"
    weight: 50
    timeout: 30s
    rateLimit: 50

pooling:
  maxIdleConnections: 10
  maxOpenConnections: 50
  connectionTimeout: 30s
  idleTimeout: 300s

retry:
  maxRetries: 3
  retryDelay: 1s
  backoffMultiplier: 2
  maxBackoff: 8s
```

### 3. Systemd Service Configuration

Create `/etc/systemd/system/mev-bot.service`:

```ini
[Unit]
Description=MEV Arbitrage Bot
After=network.target
Wants=network-online.target

[Service]
Type=simple
User=mev-bot
Group=mev-bot
WorkingDirectory=/opt/mev-bot
EnvironmentFile=/etc/systemd/system/mev-bot.env

ExecStart=/opt/mev-bot/bin/mev-bot start
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
Restart=on-failure
RestartSec=10s

# Resource limits
LimitNOFILE=65536
MemoryMax=4G
CPUQuota=400%

# Security hardening
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/var/log/mev-bot /opt/mev-bot/data

# Logging
StandardOutput=journal
StandardError=journal
SyslogIdentifier=mev-bot

[Install]
WantedBy=multi-user.target
```

---

## Deployment Steps

### Phase 1: Build & Prepare (10-15 minutes)

```bash
# 1. Build binary
cd /opt/mev-bot
make build

# Verify binary
./bin/mev-bot --version
# Expected: MEV Bot v1.0.0 (or similar)

# 2. Run tests
make test
# Ensure all tests pass

# 3. Check binary size and dependencies
ls -lh bin/mev-bot
ldd bin/mev-bot  # Should show minimal dependencies

# 4. Create necessary directories
sudo mkdir -p /var/log/mev-bot
sudo mkdir -p /opt/mev-bot/data
sudo chown -R mev-bot:mev-bot /var/log/mev-bot /opt/mev-bot/data

# 5. Set permissions
chmod +x bin/mev-bot
chmod 600 /etc/systemd/system/mev-bot.env  # Protect sensitive config
```

### Phase 2: Dry Run (5-10 minutes)

```bash
# Run bot in foreground to verify configuration
sudo -u mev-bot /opt/mev-bot/bin/mev-bot start &
BOT_PID=$!

# Wait 2 minutes for initialization
sleep 120

# Check if running
ps aux | grep mev-bot

# Check logs for errors
tail -100 /var/log/mev-bot/mev_bot.log | grep -i error

# Verify RPC connection
tail -100 /var/log/mev-bot/mev_bot.log | grep -i "connected"

# Stop dry run
kill $BOT_PID
```

### Phase 3: Production Start (5 minutes)

```bash
# 1. Reload systemd
sudo systemctl daemon-reload

# 2. Enable service (start on boot)
sudo systemctl enable mev-bot

# 3. Start service
sudo systemctl start mev-bot

# 4. Verify status
sudo systemctl status mev-bot
# Expected: active (running)

# 5. Check logs
sudo journalctl -u mev-bot -f --lines=50

# 6. Wait for initialization (30-60 seconds)
sleep 60

# 7. Verify healthy operation
curl -s http://localhost:9090/health/live | jq .
# Expected: {"status": "healthy"}
```

### Phase 4: Validation (15-30 minutes)

```bash
# 1. Monitor for opportunities
tail -f /var/log/mev-bot/mev_bot.log | grep "ARBITRAGE OPPORTUNITY"

# 2. Check metrics endpoint
curl -s http://localhost:9090/metrics | grep mev_

# 3. Verify cache performance
tail -100 /var/log/mev-bot/mev_bot.log | grep "cache metrics"
# Look for hit rate 75-85%

# 4. Check for errors
sudo journalctl -u mev-bot --since "10 minutes ago" | grep ERROR
# Should have minimal errors

# 5. Monitor resource usage
htop  # Check CPU and memory
# CPU should be 50-80%, Memory < 2GB

# 6. Test failover (optional)
# Temporarily block primary RPC, verify fallback works
```

---

## Post-Deployment Validation

### Health Checks

```bash
# Liveness probe (should return 200)
curl -f http://localhost:9090/health/live || echo "LIVENESS FAILED"

# Readiness probe (should return 200)
curl -f http://localhost:9090/health/ready || echo "READINESS FAILED"

# Startup probe (should return 200 after initialization)
curl -f http://localhost:9090/health/startup || echo "STARTUP FAILED"
```

### Performance Metrics

```bash
# Check Prometheus metrics
curl -s http://localhost:9090/metrics | grep -E "mev_(opportunities|executions|profit)"

# Expected metrics:
# - mev_opportunities_detected{} <number>
# - mev_opportunities_profitable{} <number>
# - mev_cache_hit_rate{} 0.75-0.85
# - mev_rpc_calls_total{} <number>
```

### Log Analysis

```bash
# Analyze last hour of logs
./scripts/log-manager.sh analyze

# Check health score (target: > 90)
./scripts/log-manager.sh health

# Expected output:
# Health Score: 95.5/100 (Excellent)
# Error Rate: < 5%
# Cache Hit Rate: 75-85%
```

---

## Monitoring & Alerting

### Key Metrics to Monitor

| Metric | Threshold | Action |
|--------|-----------|--------|
| CPU Usage | > 90% | Scale up or investigate |
| Memory Usage | > 85% | Potential memory leak |
| Error Rate | > 10% | Check logs, may need rollback |
| RPC Failures | > 5/min | Check RPC provider |
| Opportunities/hour | < 1 | May indicate detection issue |
| Cache Hit Rate | < 70% | Review cache configuration |

### Alert Configuration

**Slack Webhook** (edit in `config/alerts.yaml`):
```yaml
alerts:
  slack:
    enabled: true
    webhook_url: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
    channel: "#mev-bot-alerts"

  thresholds:
    error_rate: 0.10  # 10%
    cpu_usage: 0.90   # 90%
    memory_usage: 0.85  # 85%
    min_opportunities_per_hour: 1
```

### Monitoring Commands

```bash
# Real-time monitoring
watch -n 5 'systemctl status mev-bot && curl -s http://localhost:9090/metrics | grep mev_'

# Start monitoring daemon (background)
./scripts/log-manager.sh start-daemon

# View operations dashboard
./scripts/log-manager.sh dashboard
# Opens HTML dashboard in browser
```

---

## Rollback Procedures

### Quick Rollback (< 5 minutes)

```bash
# 1. Stop current version
sudo systemctl stop mev-bot

# 2. Restore previous binary
sudo cp /opt/mev-bot/bin/mev-bot.backup /opt/mev-bot/bin/mev-bot

# 3. Restart service
sudo systemctl start mev-bot

# 4. Verify rollback
sudo systemctl status mev-bot
tail -100 /var/log/mev-bot/mev_bot.log
```

### Full Rollback (< 15 minutes)

```bash
# 1. Stop service
sudo systemctl stop mev-bot

# 2. Checkout previous version
cd /opt/mev-bot
git fetch
git checkout <previous-commit-hash>

# 3. Rebuild
make build

# 4. Restart service
sudo systemctl start mev-bot

# 5. Validate
curl http://localhost:9090/health/live
```

---

## Troubleshooting

### Common Issues

#### Issue: Bot fails to start

**Symptoms:**
```
systemctl status mev-bot
● mev-bot.service - MEV Arbitrage Bot
   Loaded: loaded
   Active: failed (Result: exit-code)
```

**Diagnosis:**
```bash
# Check logs
sudo journalctl -u mev-bot -n 100 --no-pager

# Common causes:
# 1. Missing environment variables
# 2. Invalid RPC endpoint
# 3. Permission issues
```

**Solution:**
```bash
# Verify environment file
cat /etc/systemd/system/mev-bot.env

# Test RPC connection manually
curl -X POST -H "Content-Type: application/json" \
  --data '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}' \
  $ARBITRUM_RPC_ENDPOINT

# Fix permissions
sudo chown -R mev-bot:mev-bot /opt/mev-bot
```

---

#### Issue: High error rate

**Symptoms:**
```
[ERROR] Failed to fetch pool state
[ERROR] RPC call failed
[ERROR] 429 Too Many Requests
```

**Diagnosis:**
```bash
# Check error rate
./scripts/log-manager.sh analyze | grep "Error Rate"

# Check RPC provider status
curl -s $ARBITRUM_RPC_ENDPOINT
```

**Solution:**
```bash
# 1. Enable backup RPC endpoint in config
# 2. Reduce rate limits
# 3. Contact RPC provider
# 4. Switch to different provider
```

---

#### Issue: No opportunities detected

**Symptoms:**
```
Blocks processed: 10000
Opportunities detected: 0
```

**Diagnosis:**
```bash
# Check if events are being detected
tail -100 /var/log/mev-bot/mev_bot.log | grep "processing.*event"

# Check profit thresholds
grep MIN_PROFIT_THRESHOLD /etc/systemd/system/mev-bot.env
```

**Solution:**
```bash
# 1. Lower MIN_PROFIT_THRESHOLD (carefully!)
# 2. Check market conditions (volatility)
# 3. Verify DEX integrations working
# 4. Review price impact thresholds
```

---

#### Issue: Memory leak

**Symptoms:**
```
Memory usage increasing over time
OOM killer may terminate process
```

**Diagnosis:**
```bash
# Monitor memory over time
watch -n 10 'ps aux | grep mev-bot | grep -v grep'

# Generate heap profile
curl http://localhost:9090/debug/pprof/heap > heap.prof
go tool pprof heap.prof
```

**Solution:**
```bash
# 1. Restart service (temporary fix)
sudo systemctl restart mev-bot

# 2. Investigate with profiler
# 3. Check for goroutine leaks
curl http://localhost:9090/debug/pprof/goroutine?debug=1

# 4. May need code fix and redeploy
```

---

## Emergency Contacts

| Role | Name | Contact | Availability |
|------|------|---------|--------------|
| On-Call Engineer | TBD | +1-XXX-XXX-XXXX | 24/7 |
| DevOps Lead | TBD | Slack: @devops | Business hours |
| Product Owner | TBD | Email: product@company.com | Business hours |

## Change Log

| Date | Version | Changes | Author |
|------|---------|---------|--------|
| 2025-10-28 | 1.0 | Initial runbook | Claude Code |

---

**END OF RUNBOOK**

**Remember:**
1. Always test in staging first
2. Have rollback plan ready
3. Monitor closely after deployment
4. Document any issues encountered
5. Keep this runbook updated