fix(critical): complete execution pipeline - all blockers fixed and operational

2025-11-04 10:24:34 -06:00
parent 0b1c7bbc86
commit 52d555ccdf
410 changed files with 99504 additions and 28488 deletions
--- a/docs/CRITICAL_FIXES_RECOMMENDATIONS_20251030.md
+++ b/docs/CRITICAL_FIXES_RECOMMENDATIONS_20251030.md
@@ -0,0 +1,917 @@
+# Critical Fixes and Recommendations
+**Date**: 2025-10-30
+**Priority**: URGENT - Production System Failure
+**Related**: LOG_ANALYSIS_COMPREHENSIVE_REPORT_20251030.md
+
+## 🚨 IMMEDIATE ACTIONS (Next 24 Hours)
+
+### Priority 0: Fix WebSocket Connection
+**Issue**: 9,065 "unsupported protocol scheme wss" errors
+**Impact**: Cannot connect to Arbitrum network via WebSocket
+
+#### Root Cause
+Code is using HTTP client (`http.Post`) to connect to WebSocket URLs (`wss://`)
+
+#### Fix Required
+
+**File**: `pkg/arbitrum/connection.go` or `pkg/monitor/concurrent.go`
+
+**Current (Incorrect)**:
+```go
+// Somewhere in connection initialization
+client, err := rpc.Dial(wsEndpoint)  // or similar HTTP-based call
+resp, err := http.Post(wsEndpoint, ...)  // WRONG for WebSocket
+```
+
+**Fixed (Correct)**:
+```go
+import (
+    "github.com/ethereum/go-ethereum/ethclient"
+)
+
+// For WebSocket connections
+func connectWebSocket(wsURL string) (*ethclient.Client, error) {
+    client, err := ethclient.Dial(wsURL)
+    if err != nil {
+        return nil, fmt.Errorf("failed to connect to %s: %w", wsURL, err)
+    }
+    return client, nil
+}
+
+// For HTTP connections (fallback)
+func connectHTTP(httpURL string) (*ethclient.Client, error) {
+    client, err := ethclient.Dial(httpURL)
+    if err != nil {
+        return nil, fmt.Errorf("failed to connect to %s: %w", httpURL, err)
+    }
+    return client, nil
+}
+```
+
+**Implementation Steps**:
+1. Locate RPC client initialization code
+2. Check if using `rpc.Dial()` vs `ethclient.Dial()`
+3. Ensure WebSocket URLs use `ethclient.Dial()` directly
+4. Remove any HTTP POST attempts to WebSocket endpoints
+5. Test connection with: `timeout 30 ./mev-bot start`
+
+**Validation**:
+```bash
+# Should see successful WebSocket connection
+LOG_LEVEL=debug ./mev-bot start 2>&1 | grep -i "websocket\|wss"
+```
+
+### Priority 0: Fix Zero Address Parsing
+**Issue**: 100% of liquidity events contain zero addresses
+**Impact**: Invalid event data, corrupted arbitrage detection
+
+#### Root Cause
+Token address extraction from transaction logs returning zero addresses instead of actual token addresses.
+
+#### Fix Required
+
+**File**: `pkg/arbitrum/abi_decoder.go`
+
+**Current Issue**: Token extraction logic likely doing:
+```go
+// WRONG - returning zero address on extraction failure
+func extractTokenAddress(log types.Log) common.Address {
+    // If parsing fails, returns common.Address{} which is 0x000...
+    return common.Address{}
+}
+```
+
+**Fixed Implementation**:
+```go
+func extractTokenAddress(log types.Log, topicIndex int) (common.Address, error) {
+    if len(log.Topics) <= topicIndex {
+        return common.Address{}, fmt.Errorf("topic index %d out of range", topicIndex)
+    }
+
+    address := common.BytesToAddress(log.Topics[topicIndex].Bytes())
+
+    // CRITICAL: Validate address is not zero
+    if address == (common.Address{}) {
+        return common.Address{}, fmt.Errorf("extracted zero address from topic %d", topicIndex)
+    }
+
+    return address, nil
+}
+
+// For event parsing
+func parseSwapEvent(log types.Log) (*SwapEvent, error) {
+    // Extract token addresses from pool
+    pool, err := getPoolContract(log.Address)
+    if err != nil {
+        return nil, fmt.Errorf("failed to get pool: %w", err)
+    }
+
+    token0, err := pool.Token0(nil)
+    if err != nil {
+        return nil, fmt.Errorf("failed to get token0: %w", err)
+    }
+
+    token1, err := pool.Token1(nil)
+    if err != nil {
+        return nil, fmt.Errorf("failed to get token1: %w", err)
+    }
+
+    // Validate addresses
+    if token0 == (common.Address{}) || token1 == (common.Address{}) {
+        return nil, fmt.Errorf("zero address detected: token0=%s, token1=%s", token0.Hex(), token1.Hex())
+    }
+
+    return &SwapEvent{
+        Token0Address: token0,
+        Token1Address: token1,
+        // ...
+    }, nil
+}
+```
+
+**Additional Checks Needed**:
+1. Add validation before event submission
+2. Log and skip events with zero addresses
+3. Add metrics for zero address detections
+4. Review pool contract call logic
+
+**Validation**:
+```bash
+# Check for zero addresses in new events
+tail -f logs/liquidity_events_*.jsonl | jq -r '.token0Address, .token1Address' | grep -v "0x0000000000000000000000000000000000000000"
+```
+
+### Priority 0: Implement Rate Limiting Strategy
+**Issue**: 100,709 rate limit errors (429 Too Many Requests)
+**Impact**: Service degradation, failed API calls, incomplete data
+
+#### Short-Term Fix (Immediate)
+**File**: `internal/config/config.go` and `pkg/arbitrum/connection.go`
+
+```go
+type RateLimiter struct {
+    limiter   *rate.Limiter
+    maxRetries int
+    backoff    time.Duration
+}
+
+func NewRateLimiter(rps int, burst int) *RateLimiter {
+    return &RateLimiter{
+        limiter:   rate.NewLimiter(rate.Limit(rps), burst),
+        maxRetries: 3,
+        backoff:    time.Second,
+    }
+}
+
+func (rl *RateLimiter) Do(ctx context.Context, fn func() error) error {
+    for attempt := 0; attempt <= rl.maxRetries; attempt++ {
+        // Wait for rate limit token
+        if err := rl.limiter.Wait(ctx); err != nil {
+            return fmt.Errorf("rate limiter error: %w", err)
+        }
+
+        err := fn()
+        if err == nil {
+            return nil
+        }
+
+        // Check if it's a rate limit error
+        if strings.Contains(err.Error(), "429") || strings.Contains(err.Error(), "Too Many Requests") {
+            backoff := rl.backoff * time.Duration(1<<attempt) // Exponential backoff
+            log.Printf("Rate limited, backing off for %v (attempt %d/%d)", backoff, attempt+1, rl.maxRetries)
+            time.Sleep(backoff)
+            continue
+        }
+
+        return err // Non-rate-limit error
+    }
+
+    return fmt.Errorf("max retries exceeded")
+}
+```
+
+**Configuration**:
+```yaml
+# config/arbitrum_production.yaml
+rpc:
+  rate_limit:
+    requests_per_second: 10  # Conservative limit
+    burst: 20
+    max_retries: 3
+    backoff_seconds: 1
+```
+
+**Apply to all RPC calls**:
+```go
+// Example usage
+err := rateLimiter.Do(ctx, func() error {
+    block, err := client.BlockByNumber(ctx, blockNum)
+    return err
+})
+```
+
+#### Long-Term Fix (48 hours)
+**Upgrade RPC Provider**:
+1. **Option A**: Purchase paid Chainstack plan with higher RPS limits
+2. **Option B**: Add multiple RPC providers with load balancing
+3. **Option C**: Run local Arbitrum archive node
+
+**Recommended Multi-Provider Setup**:
+```go
+type RPCProvider struct {
+    Name     string
+    Endpoint string
+    RPS      int
+    Priority int
+}
+
+var providers = []RPCProvider{
+    {Name: "Chainstack", Endpoint: "wss://arbitrum-mainnet.core.chainstack.com/...", RPS: 25, Priority: 1},
+    {Name: "Alchemy", Endpoint: "wss://arb-mainnet.g.alchemy.com/v2/YOUR_KEY", RPS: 50, Priority: 2},
+    {Name: "Infura", Endpoint: "wss://arbitrum-mainnet.infura.io/ws/v3/YOUR_KEY", RPS: 50, Priority: 3},
+    {Name: "Fallback", Endpoint: "https://arb1.arbitrum.io/rpc", RPS: 5, Priority: 4},
+}
+```
+
+## 🔧 CRITICAL FIXES (24-48 Hours)
+
+### Fix 1: Connection Manager Resilience
+
+**File**: `pkg/arbitrum/connection.go`
+
+**Enhanced Connection Manager**:
+```go
+type EnhancedConnectionManager struct {
+    providers      []RPCProvider
+    activeProvider int
+    rateLimiters   map[string]*RateLimiter
+    healthChecks   map[string]*HealthStatus
+    mu             sync.RWMutex
+}
+
+type HealthStatus struct {
+    LastCheck     time.Time
+    IsHealthy     bool
+    ErrorCount    int
+    SuccessCount  int
+    Latency       time.Duration
+}
+
+func (m *EnhancedConnectionManager) GetClient(ctx context.Context) (*ethclient.Client, error) {
+    m.mu.RLock()
+    defer m.mu.RUnlock()
+
+    // Try providers in priority order
+    for _, provider := range m.sortedProviders() {
+        health := m.healthChecks[provider.Name]
+
+        // Skip unhealthy providers
+        if !health.IsHealthy {
+            continue
+        }
+
+        // Apply rate limiting
+        limiter := m.rateLimiters[provider.Name]
+        var client *ethclient.Client
+
+        err := limiter.Do(ctx, func() error {
+            c, err := ethclient.DialContext(ctx, provider.Endpoint)
+            if err != nil {
+                return err
+            }
+            client = c
+            return nil
+        })
+
+        if err == nil {
+            m.updateHealthSuccess(provider.Name)
+            return client, nil
+        }
+
+        m.updateHealthFailure(provider.Name, err)
+    }
+
+    return nil, fmt.Errorf("all RPC providers unavailable")
+}
+
+func (m *EnhancedConnectionManager) StartHealthChecks(ctx context.Context) {
+    ticker := time.NewTicker(30 * time.Second)
+    defer ticker.Stop()
+
+    for {
+        select {
+        case <-ctx.Done():
+            return
+        case <-ticker.C:
+            m.checkAllProviders(ctx)
+        }
+    }
+}
+```
+
+**Validation**:
+```bash
+# Monitor connection switching
+LOG_LEVEL=debug ./mev-bot start 2>&1 | grep -i "provider\|connection\|health"
+```
+
+### Fix 2: Correct Health Scoring
+
+**File**: `scripts/log-manager.sh:188`
+
+**Current Bug**:
+```bash
+# Line 188 - unquoted variable causing "[: too many arguments"
+if [ $error_rate -gt 10 ]; then
+```
+
+**Fixed**:
+```bash
+# Properly quote variables and handle empty values
+if [ -n "$error_rate" ] && [ "$(echo "$error_rate > 10" | bc)" -eq 1 ]; then
+    health_status="concerning"
+elif [ -n "$error_rate" ] && [ "$(echo "$error_rate > 5" | bc)" -eq 1 ]; then
+    health_status="warning"
+else
+    health_status="healthy"
+fi
+```
+
+**Enhanced Health Calculation**:
+```bash
+calculate_health_score() {
+    local total_lines=$1
+    local error_lines=$2
+    local warning_lines=$3
+    local rpc_errors=$4
+    local zero_addresses=$5
+
+    # Start with 100
+    local health_score=100
+
+    # Deduct for error rate
+    local error_rate=$(echo "scale=2; $error_lines * 100 / $total_lines" | bc -l 2>/dev/null || echo 0)
+    health_score=$(echo "$health_score - $error_rate" | bc)
+
+    # Deduct for RPC failures (each 100 failures = -1 point)
+    local rpc_penalty=$(echo "scale=2; $rpc_errors / 100" | bc -l 2>/dev/null || echo 0)
+    health_score=$(echo "$health_score - $rpc_penalty" | bc)
+
+    # Deduct for zero addresses (each occurrence = -0.01 point)
+    local zero_penalty=$(echo "scale=2; $zero_addresses / 100" | bc -l 2>/dev/null || echo 0)
+    health_score=$(echo "$health_score - $zero_penalty" | bc)
+
+    # Floor at 0
+    if [ "$(echo "$health_score < 0" | bc)" -eq 1 ]; then
+        health_score=0
+    fi
+
+    echo "$health_score"
+}
+```
+
+### Fix 3: Port Conflict Resolution
+
+**Issue**: Metrics (9090) and Dashboard (8080) port conflicts
+
+**File**: `cmd/mev-bot/main.go`
+
+**Current**:
+```go
+go startMetricsServer(":9090")
+go startDashboard(":8080")
+```
+
+**Fixed with Port Checking**:
+```go
+func startWithPortCheck(service string, preferredPort int, handler http.Handler) error {
+    port := preferredPort
+    maxAttempts := 5
+
+    for attempt := 0; attempt < maxAttempts; attempt++ {
+        addr := fmt.Sprintf(":%d", port)
+        server := &http.Server{
+            Addr:    addr,
+            Handler: handler,
+        }
+
+        listener, err := net.Listen("tcp", addr)
+        if err != nil {
+            log.Printf("%s port %d in use, trying %d", service, port, port+1)
+            port++
+            continue
+        }
+
+        log.Printf("✅ %s started on port %d", service, port)
+        return server.Serve(listener)
+    }
+
+    return fmt.Errorf("failed to start %s after %d attempts", service, maxAttempts)
+}
+
+// Usage
+go startWithPortCheck("Metrics", 9090, metricsHandler)
+go startWithPortCheck("Dashboard", 8080, dashboardHandler)
+```
+
+**Alternative - Environment Variables**:
+```go
+metricsPort := os.Getenv("METRICS_PORT")
+if metricsPort == "" {
+    metricsPort = "9090"
+}
+
+dashboardPort := os.Getenv("DASHBOARD_PORT")
+if dashboardPort == "" {
+    dashboardPort = "8080"
+}
+```
+
+## 📋 HIGH PRIORITY FIXES (48-72 Hours)
+
+### Fix 4: Implement Request Caching
+
+**Why**: Reduce RPC calls by 60-80%
+
+**File**: `pkg/arbitrum/pool_cache.go` (new)
+
+```go
+type PoolDataCache struct {
+    cache *cache.Cache // Using patrickmn/go-cache
+    mu    sync.RWMutex
+}
+
+type CachedPoolData struct {
+    Token0    common.Address
+    Token1    common.Address
+    Fee       *big.Int
+    Liquidity *big.Int
+    FetchedAt time.Time
+}
+
+func NewPoolDataCache() *PoolDataCache {
+    return &PoolDataCache{
+        cache: cache.New(5*time.Minute, 10*time.Minute),
+    }
+}
+
+func (c *PoolDataCache) GetPoolData(ctx context.Context, poolAddr common.Address, fetcher func() (*CachedPoolData, error)) (*CachedPoolData, error) {
+    key := poolAddr.Hex()
+
+    // Check cache first
+    if data, found := c.cache.Get(key); found {
+        return data.(*CachedPoolData), nil
+    }
+
+    // Cache miss - fetch from RPC
+    data, err := fetcher()
+    if err != nil {
+        return nil, err
+    }
+
+    // Store in cache
+    c.cache.Set(key, data, cache.DefaultExpiration)
+
+    return data, nil
+}
+```
+
+**Usage**:
+```go
+poolData, err := poolCache.GetPoolData(ctx, poolAddress, func() (*CachedPoolData, error) {
+    // This only runs on cache miss
+    token0, _ := poolContract.Token0(nil)
+    token1, _ := poolContract.Token1(nil)
+    fee, _ := poolContract.Fee(nil)
+    liquidity, _ := poolContract.Liquidity(nil)
+
+    return &CachedPoolData{
+        Token0:    token0,
+        Token1:    token1,
+        Fee:       fee,
+        Liquidity: liquidity,
+        FetchedAt: time.Now(),
+    }, nil
+})
+```
+
+### Fix 5: Batch RPC Requests
+
+**File**: `pkg/arbitrum/batch_requests.go` (new)
+
+```go
+type BatchRequest struct {
+    calls []rpc.BatchElem
+    mu    sync.Mutex
+}
+
+func (b *BatchRequest) AddPoolDataRequest(poolAddr common.Address) int {
+    b.mu.Lock()
+    defer b.mu.Unlock()
+
+    idx := len(b.calls)
+
+    // Add all pool data calls in one batch
+    b.calls = append(b.calls,
+        rpc.BatchElem{Method: "eth_call", Args: []interface{}{/* token0 call */}},
+        rpc.BatchElem{Method: "eth_call", Args: []interface{}{/* token1 call */}},
+        rpc.BatchElem{Method: "eth_call", Args: []interface{}{/* fee call */}},
+        rpc.BatchElem{Method: "eth_call", Args: []interface{}{/* liquidity call */}},
+    )
+
+    return idx
+}
+
+func (b *BatchRequest) Execute(client *rpc.Client) error {
+    b.mu.Lock()
+    defer b.mu.Unlock()
+
+    if len(b.calls) == 0 {
+        return nil
+    }
+
+    err := client.BatchCall(b.calls)
+    if err != nil {
+        return fmt.Errorf("batch call failed: %w", err)
+    }
+
+    // Check individual results
+    for i, call := range b.calls {
+        if call.Error != nil {
+            log.Printf("Batch call %d failed: %v", i, call.Error)
+        }
+    }
+
+    return nil
+}
+```
+
+**Impact**: Reduce 4 separate RPC calls per pool to 1 batch call
+- **Before**: 100 pools × 4 calls = 400 RPC requests
+- **After**: 100 pools ÷ 1 batch = 1 RPC request (with 400 sub-calls)
+
+### Fix 6: Improve Arbitrage Profitability Calculation
+
+**File**: `pkg/arbitrage/detection_engine.go`
+
+**Issues**:
+1. Gas cost estimation too high
+2. Slippage tolerance too conservative
+3. Zero amounts causing invalid calculations
+
+**Enhanced Calculation**:
+```go
+type ProfitCalculator struct {
+    gasPrice        *big.Int
+    priorityFee     *big.Int
+    slippageBps     int64  // Basis points (100 = 1%)
+    minProfitUSD    float64
+    executionGasLimit uint64
+}
+
+func (pc *ProfitCalculator) CalculateNetProfit(opp *Opportunity) (*ProfitEstimate, error) {
+    // Validate inputs
+    if opp.AmountIn.Cmp(big.NewInt(0)) == 0 || opp.AmountOut.Cmp(big.NewInt(0)) == 0 {
+        return nil, fmt.Errorf("zero amount detected: amountIn=%s, amountOut=%s",
+            opp.AmountIn.String(), opp.AmountOut.String())
+    }
+
+    // Calculate gross profit in ETH
+    grossProfit := new(big.Int).Sub(opp.AmountOut, opp.AmountIn)
+    grossProfitETH := new(big.Float).Quo(
+        new(big.Float).SetInt(grossProfit),
+        new(big.Float).SetInt(big.NewInt(1e18)),
+    )
+
+    // Realistic gas estimation
+    gasLimit := pc.executionGasLimit  // e.g., 300,000
+    if opp.IsMultiHop {
+        gasLimit *= 2  // Multi-hop needs more gas
+    }
+
+    gasPrice := new(big.Int).Add(pc.gasPrice, pc.priorityFee)
+    gasCost := new(big.Int).Mul(gasPrice, big.NewInt(int64(gasLimit)))
+    gasCostETH := new(big.Float).Quo(
+        new(big.Float).SetInt(gasCost),
+        new(big.Float).SetInt(big.NewInt(1e18)),
+    )
+
+    // Apply slippage tolerance
+    slippageMultiplier := float64(10000-pc.slippageBps) / 10000.0
+    grossProfitWithSlippage, _ := new(big.Float).Mul(
+        grossProfitETH,
+        big.NewFloat(slippageMultiplier),
+    ).Float64()
+
+    gasCostFloat, _ := gasCostETH.Float64()
+    netProfitETH := grossProfitWithSlippage - gasCostFloat
+
+    // Calculate in USD
+    ethPriceUSD := pc.getETHPrice()  // From oracle or cache
+    netProfitUSD := netProfitETH * ethPriceUSD
+
+    return &ProfitEstimate{
+        GrossProfitETH:    grossProfitETH,
+        GasCostETH:        gasCostETH,
+        NetProfitETH:      big.NewFloat(netProfitETH),
+        NetProfitUSD:      netProfitUSD,
+        IsExecutable:      netProfitUSD >= pc.minProfitUSD,
+        SlippageApplied:   pc.slippageBps,
+        GasLimitUsed:      gasLimit,
+    }, nil
+}
+```
+
+**Configuration**:
+```yaml
+# config/arbitrum_production.yaml
+arbitrage:
+  profit_calculation:
+    min_profit_usd: 5.0  # Minimum $5 profit
+    slippage_bps: 50     # 0.5% slippage tolerance
+    gas_limit: 300000    # Base gas limit
+    priority_fee_gwei: 0.1  # Additional priority fee
+```
+
+## 🔄 OPERATIONAL IMPROVEMENTS (Week 1)
+
+### Improvement 1: Automated Log Rotation
+
+**File**: `/etc/logrotate.d/mev-bot` (system config)
+
+```
+/home/administrator/projects/mev-beta/logs/*.log {
+    daily
+    rotate 7
+    compress
+    delaycompress
+    missingok
+    notifempty
+    create 0600 administrator administrator
+    size 50M
+    postrotate
+        /usr/bin/systemctl reload mev-bot.service > /dev/null 2>&1 || true
+    endscript
+}
+```
+
+### Improvement 2: Real-Time Alerting
+
+**File**: `pkg/monitoring/alerts.go` (new)
+
+```go
+type AlertManager struct {
+    slackWebhook  string
+    emailSMTP     string
+    thresholds    AlertThresholds
+    alertState    map[string]time.Time
+    mu            sync.Mutex
+}
+
+type AlertThresholds struct {
+    ErrorRatePercent     float64  // Alert if >10%
+    RPCFailuresPerMin    int      // Alert if >100/min
+    ZeroAddressesPerHour int      // Alert if >10/hour
+    NoOpportunitiesHours int      // Alert if no opps for N hours
+}
+
+func (am *AlertManager) CheckAndAlert(metrics *SystemMetrics) {
+    am.mu.Lock()
+    defer am.mu.Unlock()
+
+    // Error rate alert
+    if metrics.ErrorRate > am.thresholds.ErrorRatePercent {
+        if am.shouldAlert("high_error_rate", 5*time.Minute) {
+            am.sendAlert("🚨 HIGH ERROR RATE", fmt.Sprintf(
+                "Error rate: %.2f%% (threshold: %.2f%%)\nTotal errors: %d",
+                metrics.ErrorRate, am.thresholds.ErrorRatePercent, metrics.TotalErrors,
+            ))
+        }
+    }
+
+    // RPC failure alert
+    rpcFailuresPerMin := metrics.RPCFailures / int(time.Since(metrics.StartTime).Minutes())
+    if rpcFailuresPerMin > am.thresholds.RPCFailuresPerMin {
+        if am.shouldAlert("rpc_failures", 10*time.Minute) {
+            am.sendAlert("⚠️ RPC FAILURES", fmt.Sprintf(
+                "RPC failures: %d/min (threshold: %d/min)\nCheck RPC providers and rate limits",
+                rpcFailuresPerMin, am.thresholds.RPCFailuresPerMin,
+            ))
+        }
+    }
+
+    // Zero address alert
+    if metrics.ZeroAddressesLastHour > am.thresholds.ZeroAddressesPerHour {
+        if am.shouldAlert("zero_addresses", 1*time.Hour) {
+            am.sendAlert("❌ ZERO ADDRESS CONTAMINATION", fmt.Sprintf(
+                "Zero addresses detected: %d in last hour\nData integrity compromised",
+                metrics.ZeroAddressesLastHour,
+            ))
+        }
+    }
+}
+
+func (am *AlertManager) shouldAlert(alertType string, cooldown time.Duration) bool {
+    lastAlert, exists := am.alertState[alertType]
+    if !exists || time.Since(lastAlert) > cooldown {
+        am.alertState[alertType] = time.Now()
+        return true
+    }
+    return false
+}
+```
+
+### Improvement 3: Enhanced Logging with Context
+
+**File**: All files using logging
+
+**Current**:
+```go
+log.Printf("[ERROR] Failed to get pool data: %v", err)
+```
+
+**Enhanced**:
+```go
+import "log/slog"
+
+logger := slog.With(
+    "component", "pool_fetcher",
+    "pool", poolAddress.Hex(),
+    "block", blockNumber,
+)
+
+logger.Error("failed to get pool data",
+    "error", err,
+    "attempt", attempt,
+    "rpc_endpoint", currentEndpoint,
+)
+```
+
+**Benefits**:
+- Structured logging for easy parsing
+- Automatic context propagation
+- Better filtering and analysis
+- JSON output for log aggregation
+
+## 📊 MONITORING & VALIDATION
+
+### Validation Checklist
+
+After implementing fixes, validate each with:
+
+```bash
+# 1. WebSocket Connection Fix
+✅ No "unsupported protocol scheme wss" errors in logs
+✅ Successful WebSocket connection messages
+✅ Block subscription working
+
+# 2. Zero Address Fix
+✅ No zero addresses in liquidity_events_*.jsonl
+✅ Valid token addresses in all events
+✅ Factory addresses are non-zero
+
+# 3. Rate Limiting Fix
+✅ "Too Many Requests" errors reduced by >90%
+✅ Successful RPC calls >95%
+✅ Automatic backoff observable in logs
+
+# 4. Connection Manager Fix
+✅ Automatic provider failover working
+✅ Health checks passing
+✅ All providers being utilized
+
+# 5. Health Scoring Fix
+✅ Health score reflects actual system state
+✅ Score <80 when errors >20%
+✅ Alerts triggering at correct thresholds
+```
+
+### Performance Metrics to Track
+
+**Before Fixes**:
+- Error Rate: 81.1%
+- RPC Failures: 100,709
+- Zero Addresses: 5,462
+- Successful Arbitrages: 0
+- Opportunities Rejected: 100%
+
+**Target After Fixes**:
+- Error Rate: <5%
+- RPC Failures: <100/day
+- Zero Addresses: 0
+- Successful Arbitrages: >0
+- Opportunities Rejected: <80%
+
+### Test Commands
+
+```bash
+# Comprehensive system test
+./scripts/comprehensive-test.sh
+
+# Individual component tests
+go test ./pkg/arbitrum/... -v
+go test ./pkg/arbitrage/... -v
+go test ./pkg/monitor/... -v
+
+# Integration test with real data
+LOG_LEVEL=debug timeout 60 ./mev-bot start 2>&1 | tee test-run.log
+
+# Analyze test run
+./scripts/log-manager.sh analyze
+./scripts/log-manager.sh health
+```
+
+## 🎯 IMPLEMENTATION ROADMAP
+
+### Day 1 (Hours 0-24)
+- [ ] Fix WebSocket connection (2 hours)
+- [ ] Fix zero address parsing (3 hours)
+- [ ] Implement basic rate limiting (2 hours)
+- [ ] Fix health scoring script (1 hour)
+- [ ] Test and validate (2 hours)
+- [ ] Deploy to staging (1 hour)
+
+### Day 2 (Hours 24-48)
+- [ ] Enhanced connection manager (4 hours)
+- [ ] Fix port conflicts (1 hour)
+- [ ] Add multiple RPC providers (2 hours)
+- [ ] Implement request caching (3 hours)
+- [ ] Full system testing (2 hours)
+
+### Day 3 (Hours 48-72)
+- [ ] Batch RPC requests (3 hours)
+- [ ] Improve profit calculation (2 hours)
+- [ ] Add real-time alerting (2 hours)
+- [ ] Enhanced logging (2 hours)
+- [ ] Production deployment (3 hours)
+
+### Week 1 (Days 4-7)
+- [ ] Log rotation automation
+- [ ] Monitoring dashboard improvements
+- [ ] Performance optimization
+- [ ] Documentation updates
+- [ ] Team training on new systems
+
+## 🔒 RISK MITIGATION
+
+### Deployment Risks
+
+| Risk | Probability | Impact | Mitigation |
+|------|------------|--------|------------|
+| WebSocket fix breaks HTTP fallback | Medium | High | Keep HTTP client as fallback |
+| Rate limiting too aggressive | Medium | Medium | Make limits configurable |
+| Cache serves stale data | Low | Medium | Add cache invalidation on errors |
+| New errors introduced | Medium | High | Comprehensive testing + rollback plan |
+
+### Rollback Plan
+
+If issues occur after deployment:
+
+```bash
+# Quick rollback
+git revert HEAD
+make build
+systemctl restart mev-bot
+
+# Restore from backup
+cp backups/mev-bot-backup-YYYYMMDD ./mev-bot
+systemctl restart mev-bot
+
+# Check rollback success
+./scripts/log-manager.sh status
+tail -f logs/mev_bot.log
+```
+
+### Gradual Rollout
+
+1. **Staging** (Day 1): Deploy all fixes, test for 24 hours
+2. **Canary** (Day 2): Deploy to 10% of production capacity
+3. **Production** (Day 3): Full production deployment
+4. **Monitoring** (Week 1): Intensive monitoring and tuning
+
+## 📚 ADDITIONAL RESOURCES
+
+### Documentation to Update
+- [ ] CLAUDE.md - Add new configuration requirements
+- [ ] README.md - Update deployment instructions
+- [ ] TODO_AUDIT_FIX.md - Mark completed items
+- [ ] API.md - Document new monitoring endpoints
+
+### Code Reviews Required
+- WebSocket connection changes
+- Zero address validation logic
+- Rate limiting implementation
+- Connection manager enhancements
+
+### Testing Requirements
+- Unit tests for all new functions
+- Integration tests for RPC connections
+- Load testing for rate limiting
+- End-to-end arbitrage execution test
+
+---
+
+**Document Version**: 1.0
+**Last Updated**: 2025-10-30
+**Review Required**: After each fix implementation
+**Owner**: MEV Bot Development Team