# Critical Fixes and Recommendations **Date**: 2025-10-30 **Priority**: URGENT - Production System Failure **Related**: LOG_ANALYSIS_COMPREHENSIVE_REPORT_20251030.md ## ๐Ÿšจ IMMEDIATE ACTIONS (Next 24 Hours) ### Priority 0: Fix WebSocket Connection **Issue**: 9,065 "unsupported protocol scheme wss" errors **Impact**: Cannot connect to Arbitrum network via WebSocket #### Root Cause Code is using HTTP client (`http.Post`) to connect to WebSocket URLs (`wss://`) #### Fix Required **File**: `pkg/arbitrum/connection.go` or `pkg/monitor/concurrent.go` **Current (Incorrect)**: ```go // Somewhere in connection initialization client, err := rpc.Dial(wsEndpoint) // or similar HTTP-based call resp, err := http.Post(wsEndpoint, ...) // WRONG for WebSocket ``` **Fixed (Correct)**: ```go import ( "github.com/ethereum/go-ethereum/ethclient" ) // For WebSocket connections func connectWebSocket(wsURL string) (*ethclient.Client, error) { client, err := ethclient.Dial(wsURL) if err != nil { return nil, fmt.Errorf("failed to connect to %s: %w", wsURL, err) } return client, nil } // For HTTP connections (fallback) func connectHTTP(httpURL string) (*ethclient.Client, error) { client, err := ethclient.Dial(httpURL) if err != nil { return nil, fmt.Errorf("failed to connect to %s: %w", httpURL, err) } return client, nil } ``` **Implementation Steps**: 1. Locate RPC client initialization code 2. Check if using `rpc.Dial()` vs `ethclient.Dial()` 3. Ensure WebSocket URLs use `ethclient.Dial()` directly 4. Remove any HTTP POST attempts to WebSocket endpoints 5. Test connection with: `timeout 30 ./mev-bot start` **Validation**: ```bash # Should see successful WebSocket connection LOG_LEVEL=debug ./mev-bot start 2>&1 | grep -i "websocket\|wss" ``` ### Priority 0: Fix Zero Address Parsing **Issue**: 100% of liquidity events contain zero addresses **Impact**: Invalid event data, corrupted arbitrage detection #### Root Cause Token address extraction from transaction logs returning zero addresses instead of actual token addresses. #### Fix Required **File**: `pkg/arbitrum/abi_decoder.go` **Current Issue**: Token extraction logic likely doing: ```go // WRONG - returning zero address on extraction failure func extractTokenAddress(log types.Log) common.Address { // If parsing fails, returns common.Address{} which is 0x000... return common.Address{} } ``` **Fixed Implementation**: ```go func extractTokenAddress(log types.Log, topicIndex int) (common.Address, error) { if len(log.Topics) <= topicIndex { return common.Address{}, fmt.Errorf("topic index %d out of range", topicIndex) } address := common.BytesToAddress(log.Topics[topicIndex].Bytes()) // CRITICAL: Validate address is not zero if address == (common.Address{}) { return common.Address{}, fmt.Errorf("extracted zero address from topic %d", topicIndex) } return address, nil } // For event parsing func parseSwapEvent(log types.Log) (*SwapEvent, error) { // Extract token addresses from pool pool, err := getPoolContract(log.Address) if err != nil { return nil, fmt.Errorf("failed to get pool: %w", err) } token0, err := pool.Token0(nil) if err != nil { return nil, fmt.Errorf("failed to get token0: %w", err) } token1, err := pool.Token1(nil) if err != nil { return nil, fmt.Errorf("failed to get token1: %w", err) } // Validate addresses if token0 == (common.Address{}) || token1 == (common.Address{}) { return nil, fmt.Errorf("zero address detected: token0=%s, token1=%s", token0.Hex(), token1.Hex()) } return &SwapEvent{ Token0Address: token0, Token1Address: token1, // ... }, nil } ``` **Additional Checks Needed**: 1. Add validation before event submission 2. Log and skip events with zero addresses 3. Add metrics for zero address detections 4. Review pool contract call logic **Validation**: ```bash # Check for zero addresses in new events tail -f logs/liquidity_events_*.jsonl | jq -r '.token0Address, .token1Address' | grep -v "0x0000000000000000000000000000000000000000" ``` ### Priority 0: Implement Rate Limiting Strategy **Issue**: 100,709 rate limit errors (429 Too Many Requests) **Impact**: Service degradation, failed API calls, incomplete data #### Short-Term Fix (Immediate) **File**: `internal/config/config.go` and `pkg/arbitrum/connection.go` ```go type RateLimiter struct { limiter *rate.Limiter maxRetries int backoff time.Duration } func NewRateLimiter(rps int, burst int) *RateLimiter { return &RateLimiter{ limiter: rate.NewLimiter(rate.Limit(rps), burst), maxRetries: 3, backoff: time.Second, } } func (rl *RateLimiter) Do(ctx context.Context, fn func() error) error { for attempt := 0; attempt <= rl.maxRetries; attempt++ { // Wait for rate limit token if err := rl.limiter.Wait(ctx); err != nil { return fmt.Errorf("rate limiter error: %w", err) } err := fn() if err == nil { return nil } // Check if it's a rate limit error if strings.Contains(err.Error(), "429") || strings.Contains(err.Error(), "Too Many Requests") { backoff := rl.backoff * time.Duration(1<&1 | grep -i "provider\|connection\|health" ``` ### Fix 2: Correct Health Scoring **File**: `scripts/log-manager.sh:188` **Current Bug**: ```bash # Line 188 - unquoted variable causing "[: too many arguments" if [ $error_rate -gt 10 ]; then ``` **Fixed**: ```bash # Properly quote variables and handle empty values if [ -n "$error_rate" ] && [ "$(echo "$error_rate > 10" | bc)" -eq 1 ]; then health_status="concerning" elif [ -n "$error_rate" ] && [ "$(echo "$error_rate > 5" | bc)" -eq 1 ]; then health_status="warning" else health_status="healthy" fi ``` **Enhanced Health Calculation**: ```bash calculate_health_score() { local total_lines=$1 local error_lines=$2 local warning_lines=$3 local rpc_errors=$4 local zero_addresses=$5 # Start with 100 local health_score=100 # Deduct for error rate local error_rate=$(echo "scale=2; $error_lines * 100 / $total_lines" | bc -l 2>/dev/null || echo 0) health_score=$(echo "$health_score - $error_rate" | bc) # Deduct for RPC failures (each 100 failures = -1 point) local rpc_penalty=$(echo "scale=2; $rpc_errors / 100" | bc -l 2>/dev/null || echo 0) health_score=$(echo "$health_score - $rpc_penalty" | bc) # Deduct for zero addresses (each occurrence = -0.01 point) local zero_penalty=$(echo "scale=2; $zero_addresses / 100" | bc -l 2>/dev/null || echo 0) health_score=$(echo "$health_score - $zero_penalty" | bc) # Floor at 0 if [ "$(echo "$health_score < 0" | bc)" -eq 1 ]; then health_score=0 fi echo "$health_score" } ``` ### Fix 3: Port Conflict Resolution **Issue**: Metrics (9090) and Dashboard (8080) port conflicts **File**: `cmd/mev-bot/main.go` **Current**: ```go go startMetricsServer(":9090") go startDashboard(":8080") ``` **Fixed with Port Checking**: ```go func startWithPortCheck(service string, preferredPort int, handler http.Handler) error { port := preferredPort maxAttempts := 5 for attempt := 0; attempt < maxAttempts; attempt++ { addr := fmt.Sprintf(":%d", port) server := &http.Server{ Addr: addr, Handler: handler, } listener, err := net.Listen("tcp", addr) if err != nil { log.Printf("%s port %d in use, trying %d", service, port, port+1) port++ continue } log.Printf("โœ… %s started on port %d", service, port) return server.Serve(listener) } return fmt.Errorf("failed to start %s after %d attempts", service, maxAttempts) } // Usage go startWithPortCheck("Metrics", 9090, metricsHandler) go startWithPortCheck("Dashboard", 8080, dashboardHandler) ``` **Alternative - Environment Variables**: ```go metricsPort := os.Getenv("METRICS_PORT") if metricsPort == "" { metricsPort = "9090" } dashboardPort := os.Getenv("DASHBOARD_PORT") if dashboardPort == "" { dashboardPort = "8080" } ``` ## ๐Ÿ“‹ HIGH PRIORITY FIXES (48-72 Hours) ### Fix 4: Implement Request Caching **Why**: Reduce RPC calls by 60-80% **File**: `pkg/arbitrum/pool_cache.go` (new) ```go type PoolDataCache struct { cache *cache.Cache // Using patrickmn/go-cache mu sync.RWMutex } type CachedPoolData struct { Token0 common.Address Token1 common.Address Fee *big.Int Liquidity *big.Int FetchedAt time.Time } func NewPoolDataCache() *PoolDataCache { return &PoolDataCache{ cache: cache.New(5*time.Minute, 10*time.Minute), } } func (c *PoolDataCache) GetPoolData(ctx context.Context, poolAddr common.Address, fetcher func() (*CachedPoolData, error)) (*CachedPoolData, error) { key := poolAddr.Hex() // Check cache first if data, found := c.cache.Get(key); found { return data.(*CachedPoolData), nil } // Cache miss - fetch from RPC data, err := fetcher() if err != nil { return nil, err } // Store in cache c.cache.Set(key, data, cache.DefaultExpiration) return data, nil } ``` **Usage**: ```go poolData, err := poolCache.GetPoolData(ctx, poolAddress, func() (*CachedPoolData, error) { // This only runs on cache miss token0, _ := poolContract.Token0(nil) token1, _ := poolContract.Token1(nil) fee, _ := poolContract.Fee(nil) liquidity, _ := poolContract.Liquidity(nil) return &CachedPoolData{ Token0: token0, Token1: token1, Fee: fee, Liquidity: liquidity, FetchedAt: time.Now(), }, nil }) ``` ### Fix 5: Batch RPC Requests **File**: `pkg/arbitrum/batch_requests.go` (new) ```go type BatchRequest struct { calls []rpc.BatchElem mu sync.Mutex } func (b *BatchRequest) AddPoolDataRequest(poolAddr common.Address) int { b.mu.Lock() defer b.mu.Unlock() idx := len(b.calls) // Add all pool data calls in one batch b.calls = append(b.calls, rpc.BatchElem{Method: "eth_call", Args: []interface{}{/* token0 call */}}, rpc.BatchElem{Method: "eth_call", Args: []interface{}{/* token1 call */}}, rpc.BatchElem{Method: "eth_call", Args: []interface{}{/* fee call */}}, rpc.BatchElem{Method: "eth_call", Args: []interface{}{/* liquidity call */}}, ) return idx } func (b *BatchRequest) Execute(client *rpc.Client) error { b.mu.Lock() defer b.mu.Unlock() if len(b.calls) == 0 { return nil } err := client.BatchCall(b.calls) if err != nil { return fmt.Errorf("batch call failed: %w", err) } // Check individual results for i, call := range b.calls { if call.Error != nil { log.Printf("Batch call %d failed: %v", i, call.Error) } } return nil } ``` **Impact**: Reduce 4 separate RPC calls per pool to 1 batch call - **Before**: 100 pools ร— 4 calls = 400 RPC requests - **After**: 100 pools รท 1 batch = 1 RPC request (with 400 sub-calls) ### Fix 6: Improve Arbitrage Profitability Calculation **File**: `pkg/arbitrage/detection_engine.go` **Issues**: 1. Gas cost estimation too high 2. Slippage tolerance too conservative 3. Zero amounts causing invalid calculations **Enhanced Calculation**: ```go type ProfitCalculator struct { gasPrice *big.Int priorityFee *big.Int slippageBps int64 // Basis points (100 = 1%) minProfitUSD float64 executionGasLimit uint64 } func (pc *ProfitCalculator) CalculateNetProfit(opp *Opportunity) (*ProfitEstimate, error) { // Validate inputs if opp.AmountIn.Cmp(big.NewInt(0)) == 0 || opp.AmountOut.Cmp(big.NewInt(0)) == 0 { return nil, fmt.Errorf("zero amount detected: amountIn=%s, amountOut=%s", opp.AmountIn.String(), opp.AmountOut.String()) } // Calculate gross profit in ETH grossProfit := new(big.Int).Sub(opp.AmountOut, opp.AmountIn) grossProfitETH := new(big.Float).Quo( new(big.Float).SetInt(grossProfit), new(big.Float).SetInt(big.NewInt(1e18)), ) // Realistic gas estimation gasLimit := pc.executionGasLimit // e.g., 300,000 if opp.IsMultiHop { gasLimit *= 2 // Multi-hop needs more gas } gasPrice := new(big.Int).Add(pc.gasPrice, pc.priorityFee) gasCost := new(big.Int).Mul(gasPrice, big.NewInt(int64(gasLimit))) gasCostETH := new(big.Float).Quo( new(big.Float).SetInt(gasCost), new(big.Float).SetInt(big.NewInt(1e18)), ) // Apply slippage tolerance slippageMultiplier := float64(10000-pc.slippageBps) / 10000.0 grossProfitWithSlippage, _ := new(big.Float).Mul( grossProfitETH, big.NewFloat(slippageMultiplier), ).Float64() gasCostFloat, _ := gasCostETH.Float64() netProfitETH := grossProfitWithSlippage - gasCostFloat // Calculate in USD ethPriceUSD := pc.getETHPrice() // From oracle or cache netProfitUSD := netProfitETH * ethPriceUSD return &ProfitEstimate{ GrossProfitETH: grossProfitETH, GasCostETH: gasCostETH, NetProfitETH: big.NewFloat(netProfitETH), NetProfitUSD: netProfitUSD, IsExecutable: netProfitUSD >= pc.minProfitUSD, SlippageApplied: pc.slippageBps, GasLimitUsed: gasLimit, }, nil } ``` **Configuration**: ```yaml # config/arbitrum_production.yaml arbitrage: profit_calculation: min_profit_usd: 5.0 # Minimum $5 profit slippage_bps: 50 # 0.5% slippage tolerance gas_limit: 300000 # Base gas limit priority_fee_gwei: 0.1 # Additional priority fee ``` ## ๐Ÿ”„ OPERATIONAL IMPROVEMENTS (Week 1) ### Improvement 1: Automated Log Rotation **File**: `/etc/logrotate.d/mev-bot` (system config) ``` /home/administrator/projects/mev-beta/logs/*.log { daily rotate 7 compress delaycompress missingok notifempty create 0600 administrator administrator size 50M postrotate /usr/bin/systemctl reload mev-bot.service > /dev/null 2>&1 || true endscript } ``` ### Improvement 2: Real-Time Alerting **File**: `pkg/monitoring/alerts.go` (new) ```go type AlertManager struct { slackWebhook string emailSMTP string thresholds AlertThresholds alertState map[string]time.Time mu sync.Mutex } type AlertThresholds struct { ErrorRatePercent float64 // Alert if >10% RPCFailuresPerMin int // Alert if >100/min ZeroAddressesPerHour int // Alert if >10/hour NoOpportunitiesHours int // Alert if no opps for N hours } func (am *AlertManager) CheckAndAlert(metrics *SystemMetrics) { am.mu.Lock() defer am.mu.Unlock() // Error rate alert if metrics.ErrorRate > am.thresholds.ErrorRatePercent { if am.shouldAlert("high_error_rate", 5*time.Minute) { am.sendAlert("๐Ÿšจ HIGH ERROR RATE", fmt.Sprintf( "Error rate: %.2f%% (threshold: %.2f%%)\nTotal errors: %d", metrics.ErrorRate, am.thresholds.ErrorRatePercent, metrics.TotalErrors, )) } } // RPC failure alert rpcFailuresPerMin := metrics.RPCFailures / int(time.Since(metrics.StartTime).Minutes()) if rpcFailuresPerMin > am.thresholds.RPCFailuresPerMin { if am.shouldAlert("rpc_failures", 10*time.Minute) { am.sendAlert("โš ๏ธ RPC FAILURES", fmt.Sprintf( "RPC failures: %d/min (threshold: %d/min)\nCheck RPC providers and rate limits", rpcFailuresPerMin, am.thresholds.RPCFailuresPerMin, )) } } // Zero address alert if metrics.ZeroAddressesLastHour > am.thresholds.ZeroAddressesPerHour { if am.shouldAlert("zero_addresses", 1*time.Hour) { am.sendAlert("โŒ ZERO ADDRESS CONTAMINATION", fmt.Sprintf( "Zero addresses detected: %d in last hour\nData integrity compromised", metrics.ZeroAddressesLastHour, )) } } } func (am *AlertManager) shouldAlert(alertType string, cooldown time.Duration) bool { lastAlert, exists := am.alertState[alertType] if !exists || time.Since(lastAlert) > cooldown { am.alertState[alertType] = time.Now() return true } return false } ``` ### Improvement 3: Enhanced Logging with Context **File**: All files using logging **Current**: ```go log.Printf("[ERROR] Failed to get pool data: %v", err) ``` **Enhanced**: ```go import "log/slog" logger := slog.With( "component", "pool_fetcher", "pool", poolAddress.Hex(), "block", blockNumber, ) logger.Error("failed to get pool data", "error", err, "attempt", attempt, "rpc_endpoint", currentEndpoint, ) ``` **Benefits**: - Structured logging for easy parsing - Automatic context propagation - Better filtering and analysis - JSON output for log aggregation ## ๐Ÿ“Š MONITORING & VALIDATION ### Validation Checklist After implementing fixes, validate each with: ```bash # 1. WebSocket Connection Fix โœ… No "unsupported protocol scheme wss" errors in logs โœ… Successful WebSocket connection messages โœ… Block subscription working # 2. Zero Address Fix โœ… No zero addresses in liquidity_events_*.jsonl โœ… Valid token addresses in all events โœ… Factory addresses are non-zero # 3. Rate Limiting Fix โœ… "Too Many Requests" errors reduced by >90% โœ… Successful RPC calls >95% โœ… Automatic backoff observable in logs # 4. Connection Manager Fix โœ… Automatic provider failover working โœ… Health checks passing โœ… All providers being utilized # 5. Health Scoring Fix โœ… Health score reflects actual system state โœ… Score <80 when errors >20% โœ… Alerts triggering at correct thresholds ``` ### Performance Metrics to Track **Before Fixes**: - Error Rate: 81.1% - RPC Failures: 100,709 - Zero Addresses: 5,462 - Successful Arbitrages: 0 - Opportunities Rejected: 100% **Target After Fixes**: - Error Rate: <5% - RPC Failures: <100/day - Zero Addresses: 0 - Successful Arbitrages: >0 - Opportunities Rejected: <80% ### Test Commands ```bash # Comprehensive system test ./scripts/comprehensive-test.sh # Individual component tests go test ./pkg/arbitrum/... -v go test ./pkg/arbitrage/... -v go test ./pkg/monitor/... -v # Integration test with real data LOG_LEVEL=debug timeout 60 ./mev-bot start 2>&1 | tee test-run.log # Analyze test run ./scripts/log-manager.sh analyze ./scripts/log-manager.sh health ``` ## ๐ŸŽฏ IMPLEMENTATION ROADMAP ### Day 1 (Hours 0-24) - [ ] Fix WebSocket connection (2 hours) - [ ] Fix zero address parsing (3 hours) - [ ] Implement basic rate limiting (2 hours) - [ ] Fix health scoring script (1 hour) - [ ] Test and validate (2 hours) - [ ] Deploy to staging (1 hour) ### Day 2 (Hours 24-48) - [ ] Enhanced connection manager (4 hours) - [ ] Fix port conflicts (1 hour) - [ ] Add multiple RPC providers (2 hours) - [ ] Implement request caching (3 hours) - [ ] Full system testing (2 hours) ### Day 3 (Hours 48-72) - [ ] Batch RPC requests (3 hours) - [ ] Improve profit calculation (2 hours) - [ ] Add real-time alerting (2 hours) - [ ] Enhanced logging (2 hours) - [ ] Production deployment (3 hours) ### Week 1 (Days 4-7) - [ ] Log rotation automation - [ ] Monitoring dashboard improvements - [ ] Performance optimization - [ ] Documentation updates - [ ] Team training on new systems ## ๐Ÿ”’ RISK MITIGATION ### Deployment Risks | Risk | Probability | Impact | Mitigation | |------|------------|--------|------------| | WebSocket fix breaks HTTP fallback | Medium | High | Keep HTTP client as fallback | | Rate limiting too aggressive | Medium | Medium | Make limits configurable | | Cache serves stale data | Low | Medium | Add cache invalidation on errors | | New errors introduced | Medium | High | Comprehensive testing + rollback plan | ### Rollback Plan If issues occur after deployment: ```bash # Quick rollback git revert HEAD make build systemctl restart mev-bot # Restore from backup cp backups/mev-bot-backup-YYYYMMDD ./mev-bot systemctl restart mev-bot # Check rollback success ./scripts/log-manager.sh status tail -f logs/mev_bot.log ``` ### Gradual Rollout 1. **Staging** (Day 1): Deploy all fixes, test for 24 hours 2. **Canary** (Day 2): Deploy to 10% of production capacity 3. **Production** (Day 3): Full production deployment 4. **Monitoring** (Week 1): Intensive monitoring and tuning ## ๐Ÿ“š ADDITIONAL RESOURCES ### Documentation to Update - [ ] CLAUDE.md - Add new configuration requirements - [ ] README.md - Update deployment instructions - [ ] TODO_AUDIT_FIX.md - Mark completed items - [ ] API.md - Document new monitoring endpoints ### Code Reviews Required - WebSocket connection changes - Zero address validation logic - Rate limiting implementation - Connection manager enhancements ### Testing Requirements - Unit tests for all new functions - Integration tests for RPC connections - Load testing for rate limiting - End-to-end arbitrage execution test --- **Document Version**: 1.0 **Last Updated**: 2025-10-30 **Review Required**: After each fix implementation **Owner**: MEV Bot Development Team