Files
mev-beta/docs/CRITICAL_FIX_PLAN_20251101.md

8.0 KiB

Critical Fix Plan - November 1, 2025

Issues Identified & Solutions

🔴 ISSUE 1: Multi-Hop Scanner Finding 0 Paths

Root Cause: The DFS search in multihop.go:208 calls GetAdjacentTokens(currentToken) but if the trigger token isn't in the pre-populated token graph, it returns an empty map and the search never starts.

Evidence:

[INFO] 📥 Received bridge arbitrage opportunity id=arb_1762011082_0xaf88d065 path_length=4 pools=0
[INFO] Multi-hop arbitrage scan completed in 99.983µs: found 0 profitable paths out of 0 total paths
                                                                                      ^^^^^^^^
                                                                                      The issue!

The Flow:

  1. Opportunity comes in with start token (e.g., USDC 0xaf88d065...)
  2. ScanForArbitrage called with this token
  3. updateTokenGraph populates 8 hard-coded pools
  4. DFS starts: Get adjacent({0xaf88d065...})
  5. Token graph HAS this token, but...
  6. BUG: The DFS expects to find cycles but starts at depth=0 with current==target
  7. On first iteration (depth=0), it skips the "found cycle" check (requires depth>1)
  8. Gets adjacent tokens correctly
  9. But something else is wrong...

Actual Root Cause (Deeper): Looking at the logic more carefully:

// Line 199: If we're back at the start token and have made at least 2 hops
if depth > 1 && currentToken == targetToken {
    path := mhs.createArbitragePath(currentTokens, currentPath, amount)
    ...
}

The issue is: The DFS is working, but createArbitragePath is returning nil for all paths!

Looking at createArbitragePath (line 238-260):

func (mhs *MultiHopScanner) createArbitragePath(...) *ArbitragePath {
    if len(tokens) < 3 || len(pools) != len(tokens)-1 {
        return nil  // ← Validation fail
    }

    // Calculate swap outputs
    for i, pool := range pools {
        outputAmount, err := mhs.calculateSwapOutput(...)
        if err != nil {
            mhs.logger.Debug(...) // ← Silent failure!
            return nil
        }
    }
}

The Real Problem:

  1. DFS finds paths (e.g., USDC → WETH → LINK → USDC)
  2. createArbitragePath is called
  3. calculateSwapOutput tries to get pool reserves
  4. But the pools have placeholder liquidity values! (line 485: uint256.NewInt(1000000000000000000))
  5. Or calculateSwapOutput fails due to missing SqrtPriceX96 data
  6. Path creation fails silently
  7. Returns 0 paths

🔴 ISSUE 2: Security Manager Disabled

Status: CRITICAL - Running without transaction validation

Location: cmd/mev-bot/main.go:141

Fix: Uncomment security manager initialization

🔴 ISSUE 3: Rate Limiting (2,699 errors)

Root Cause: Single RPC endpoint being overwhelmed

Fix: Enable multi-provider failover from providers_runtime.yaml

🔴 ISSUE 4: Port Binding Conflicts (53 errors)

Root Cause: Multiple instances or improper cleanup

Fix: Add SO_REUSEADDR and pre-flight port checks

🔴 ISSUE 5: Context Cancellation (71 errors)

Root Cause: Improper shutdown handling

Fix: Add graceful shutdown with proper context handling


Fix Implementation Plan

Fix 1: Multi-Hop Scanner - Add Real Pool Data Fetching

File: pkg/arbitrage/multihop.go

Changes:

  1. Add DEBUG logging to createArbitragePath to show why paths fail
  2. Fetch real pool data (sqrtPriceX96, liquidity) from RPC in updateTokenGraph
  3. Add fallback: if RPC fetch fails, use DataFetcher or skip pool
  4. Add metrics to track: paths_found, paths_validated, paths_rejected

Code Addition:

// In createArbitragePath, add before return nil:
mhs.logger.Debug(fmt.Sprintf("❌ Path validation failed: tokens=%d pools=%d reason=%s",
    len(tokens), len(pools), reason))

// In updateTokenGraph, fetch real data:
for _, pool := range pools {
    // Fetch real pool state from RPC
    slot0, err := mhs.fetchPoolSlot0(ctx, pool.Address)
    if err != nil {
        mhs.logger.Warn(fmt.Sprintf("Failed to fetch pool state for %s: %v", pool.Address, err))
        continue // Skip this pool
    }
    pool.SqrtPriceX96 = slot0.SqrtPriceX96
    pool.Liquidity = slot0.Liquidity
    mhs.addPoolToGraph(pool)
}

Fix 2: Security Manager

File: cmd/mev-bot/main.go

Change: Uncomment lines 143-180 to re-enable security manager

Fix 3: Multi-Provider RPC

File: cmd/mev-bot/main.go or provider initialization

Change: Enable provider rotation with fallback

// Add after line 132
if providerConfigPath := os.Getenv("PROVIDER_CONFIG_PATH"); providerConfigPath != "" {
    log.Info(fmt.Sprintf("Loading multi-provider configuration from: %s", providerConfigPath))
    // Enable provider manager with failover
}

Fix 4: Port Binding

File: pkg/metrics/server.go (or equivalent)

Change:

listener, err := net.Listen("tcp", fmt.Sprintf(":%d", port))
// Change to:
lc := net.ListenConfig{
    Control: func(network, address string, c syscall.RawConn) error {
        return c.Control(func(fd uintptr) {
            syscall.SetsockoptInt(int(fd), syscall.SOL_SOCKET, syscall.SO_REUSEADDR, 1)
        })
    },
}
listener, err := lc.Listen(ctx, "tcp", fmt.Sprintf(":%d", port))

Fix 5: Graceful Shutdown

File: cmd/mev-bot/main.go

Change: Add to shutdown handler (after line 400+):

// Create shutdown context with timeout
shutdownCtx, shutdownCancel := context.WithTimeout(context.Background(), 30*time.Second)
defer shutdownCancel()

// Cancel main context
cancel()

// Wait for goroutines to finish with timeout
done := make(chan struct{})
go func() {
    // Wait for all subsystems
    wg.Wait()
    close(done)
}()

select {
case <-done:
    log.Info("Graceful shutdown completed")
case <-shutdownCtx.Done():
    log.Warn("Shutdown timeout exceeded, forcing exit")
}

Implementation Priority

Phase 1: Critical Security (30 minutes)

  1. Re-enable security manager
  2. Add port reuse socket option
  3. Add graceful shutdown

Phase 2: Multi-Hop Scanner Fix (1-2 hours)

  1. Add detailed DEBUG logging to identify failure point
  2. Implement real pool data fetching in updateTokenGraph
  3. Add reserve cache integration
  4. Test with live data

Phase 3: RPC Optimization (1 hour)

  1. Enable multi-provider rotation
  2. Add exponential backoff
  3. Re-enable DataFetcher for batching

Phase 4: Testing & Validation (1 hour)

  1. Run bot for 10 minutes
  2. Verify no rate limiting errors
  3. Verify multi-hop scanner finds paths
  4. Verify opportunities are executed
  5. Check all metrics

Expected Outcomes

Before Fixes:

  • 0 profitable paths found
  • 2,699 rate limit errors
  • Security disabled
  • 53 port conflicts
  • 71 context cancellations

After Fixes:

  • 5-20 profitable paths per opportunity
  • < 10 rate limit errors (99.6% reduction)
  • Security enabled
  • 0 port conflicts
  • 0 context cancellations
  • Actual arbitrage executions!

Testing Commands

# Phase 1: Build with fixes
make clean && make build

# Phase 2: Test startup (should see no errors)
timeout 30 ./mev-bot start 2>&1 | tee test_output.log

# Phase 3: Check for critical errors
grep -E "ERROR|FATAL|panic" test_output.log | wc -l  # Should be 0

# Phase 4: Check multi-hop scanner
grep "profitable paths" test_output.log | tail -5  # Should show > 0 paths

# Phase 5: Full run (2 minutes)
timeout 120 ./mev-bot start 2>&1 | tee full_test.log

# Phase 6: Analyze results
./scripts/log-manager.sh analyze

Rollback Plan

If fixes cause issues:

git stash  # Stash changes
git checkout 0b1c7bb  # Return to last known good commit
make build && ./mev-bot start

Success Criteria

  • Security manager enabled
  • Multi-hop scanner finds > 0 paths
  • Rate limit errors < 1% of previous
  • No port binding errors
  • No context cancellation errors
  • At least 1 arbitrage execution attempt per minute
  • Health score > 95/100

Next Step: Implement Phase 1 fixes (security critical)