Files
mev-beta/docs/CRITICAL_FIX_PLAN_20251101.md

292 lines
8.0 KiB
Markdown

# Critical Fix Plan - November 1, 2025
## Issues Identified & Solutions
### 🔴 ISSUE 1: Multi-Hop Scanner Finding 0 Paths
**Root Cause:**
The DFS search in `multihop.go:208` calls `GetAdjacentTokens(currentToken)` but if the trigger token isn't in the pre-populated token graph, it returns an empty map and the search never starts.
**Evidence:**
```
[INFO] 📥 Received bridge arbitrage opportunity id=arb_1762011082_0xaf88d065 path_length=4 pools=0
[INFO] Multi-hop arbitrage scan completed in 99.983µs: found 0 profitable paths out of 0 total paths
^^^^^^^^
The issue!
```
**The Flow:**
1. Opportunity comes in with start token (e.g., USDC `0xaf88d065...`)
2. `ScanForArbitrage` called with this token
3. `updateTokenGraph` populates 8 hard-coded pools
4. DFS starts: `Get adjacent({0xaf88d065...})`
5. Token graph HAS this token, but...
6. **BUG**: The DFS expects to find cycles but starts at depth=0 with current==target
7. On first iteration (depth=0), it skips the "found cycle" check (requires depth>1)
8. Gets adjacent tokens correctly
9. But something else is wrong...
**Actual Root Cause (Deeper):**
Looking at the logic more carefully:
```go
// Line 199: If we're back at the start token and have made at least 2 hops
if depth > 1 && currentToken == targetToken {
path := mhs.createArbitragePath(currentTokens, currentPath, amount)
...
}
```
The issue is: **The DFS is working, but `createArbitragePath` is returning `nil`** for all paths!
Looking at `createArbitragePath` (line 238-260):
```go
func (mhs *MultiHopScanner) createArbitragePath(...) *ArbitragePath {
if len(tokens) < 3 || len(pools) != len(tokens)-1 {
return nil // ← Validation fail
}
// Calculate swap outputs
for i, pool := range pools {
outputAmount, err := mhs.calculateSwapOutput(...)
if err != nil {
mhs.logger.Debug(...) // ← Silent failure!
return nil
}
}
}
```
**The Real Problem:**
1. DFS finds paths (e.g., USDC → WETH → LINK → USDC)
2. `createArbitragePath` is called
3. `calculateSwapOutput` tries to get pool reserves
4. **But the pools have placeholder liquidity values!** (line 485: `uint256.NewInt(1000000000000000000)`)
5. Or `calculateSwapOutput` fails due to missing SqrtPriceX96 data
6. Path creation fails silently
7. Returns 0 paths
### 🔴 ISSUE 2: Security Manager Disabled
**Status:** CRITICAL - Running without transaction validation
**Location:** `cmd/mev-bot/main.go:141`
**Fix:** Uncomment security manager initialization
### 🔴 ISSUE 3: Rate Limiting (2,699 errors)
**Root Cause:** Single RPC endpoint being overwhelmed
**Fix:** Enable multi-provider failover from `providers_runtime.yaml`
### 🔴 ISSUE 4: Port Binding Conflicts (53 errors)
**Root Cause:** Multiple instances or improper cleanup
**Fix:** Add SO_REUSEADDR and pre-flight port checks
### 🔴 ISSUE 5: Context Cancellation (71 errors)
**Root Cause:** Improper shutdown handling
**Fix:** Add graceful shutdown with proper context handling
---
## Fix Implementation Plan
### Fix 1: Multi-Hop Scanner - Add Real Pool Data Fetching
**File:** `pkg/arbitrage/multihop.go`
**Changes:**
1. Add DEBUG logging to `createArbitragePath` to show why paths fail
2. Fetch real pool data (sqrtPriceX96, liquidity) from RPC in `updateTokenGraph`
3. Add fallback: if RPC fetch fails, use DataFetcher or skip pool
4. Add metrics to track: paths_found, paths_validated, paths_rejected
**Code Addition:**
```go
// In createArbitragePath, add before return nil:
mhs.logger.Debug(fmt.Sprintf("❌ Path validation failed: tokens=%d pools=%d reason=%s",
len(tokens), len(pools), reason))
// In updateTokenGraph, fetch real data:
for _, pool := range pools {
// Fetch real pool state from RPC
slot0, err := mhs.fetchPoolSlot0(ctx, pool.Address)
if err != nil {
mhs.logger.Warn(fmt.Sprintf("Failed to fetch pool state for %s: %v", pool.Address, err))
continue // Skip this pool
}
pool.SqrtPriceX96 = slot0.SqrtPriceX96
pool.Liquidity = slot0.Liquidity
mhs.addPoolToGraph(pool)
}
```
### Fix 2: Security Manager
**File:** `cmd/mev-bot/main.go`
**Change:** Uncomment lines 143-180 to re-enable security manager
### Fix 3: Multi-Provider RPC
**File:** `cmd/mev-bot/main.go` or provider initialization
**Change:** Enable provider rotation with fallback
```go
// Add after line 132
if providerConfigPath := os.Getenv("PROVIDER_CONFIG_PATH"); providerConfigPath != "" {
log.Info(fmt.Sprintf("Loading multi-provider configuration from: %s", providerConfigPath))
// Enable provider manager with failover
}
```
### Fix 4: Port Binding
**File:** `pkg/metrics/server.go` (or equivalent)
**Change:**
```go
listener, err := net.Listen("tcp", fmt.Sprintf(":%d", port))
// Change to:
lc := net.ListenConfig{
Control: func(network, address string, c syscall.RawConn) error {
return c.Control(func(fd uintptr) {
syscall.SetsockoptInt(int(fd), syscall.SOL_SOCKET, syscall.SO_REUSEADDR, 1)
})
},
}
listener, err := lc.Listen(ctx, "tcp", fmt.Sprintf(":%d", port))
```
### Fix 5: Graceful Shutdown
**File:** `cmd/mev-bot/main.go`
**Change:** Add to shutdown handler (after line 400+):
```go
// Create shutdown context with timeout
shutdownCtx, shutdownCancel := context.WithTimeout(context.Background(), 30*time.Second)
defer shutdownCancel()
// Cancel main context
cancel()
// Wait for goroutines to finish with timeout
done := make(chan struct{})
go func() {
// Wait for all subsystems
wg.Wait()
close(done)
}()
select {
case <-done:
log.Info("Graceful shutdown completed")
case <-shutdownCtx.Done():
log.Warn("Shutdown timeout exceeded, forcing exit")
}
```
---
## Implementation Priority
### Phase 1: Critical Security (30 minutes)
1. ✅ Re-enable security manager
2. ✅ Add port reuse socket option
3. ✅ Add graceful shutdown
### Phase 2: Multi-Hop Scanner Fix (1-2 hours)
1. ✅ Add detailed DEBUG logging to identify failure point
2. ✅ Implement real pool data fetching in updateTokenGraph
3. ✅ Add reserve cache integration
4. ✅ Test with live data
### Phase 3: RPC Optimization (1 hour)
1. ✅ Enable multi-provider rotation
2. ✅ Add exponential backoff
3. ✅ Re-enable DataFetcher for batching
### Phase 4: Testing & Validation (1 hour)
1. ✅ Run bot for 10 minutes
2. ✅ Verify no rate limiting errors
3. ✅ Verify multi-hop scanner finds paths
4. ✅ Verify opportunities are executed
5. ✅ Check all metrics
---
## Expected Outcomes
### Before Fixes:
- ❌ 0 profitable paths found
- ❌ 2,699 rate limit errors
- ❌ Security disabled
- ❌ 53 port conflicts
- ❌ 71 context cancellations
### After Fixes:
- ✅ 5-20 profitable paths per opportunity
- ✅ < 10 rate limit errors (99.6% reduction)
- ✅ Security enabled
- ✅ 0 port conflicts
- ✅ 0 context cancellations
- ✅ Actual arbitrage executions!
---
## Testing Commands
```bash
# Phase 1: Build with fixes
make clean && make build
# Phase 2: Test startup (should see no errors)
timeout 30 ./mev-bot start 2>&1 | tee test_output.log
# Phase 3: Check for critical errors
grep -E "ERROR|FATAL|panic" test_output.log | wc -l # Should be 0
# Phase 4: Check multi-hop scanner
grep "profitable paths" test_output.log | tail -5 # Should show > 0 paths
# Phase 5: Full run (2 minutes)
timeout 120 ./mev-bot start 2>&1 | tee full_test.log
# Phase 6: Analyze results
./scripts/log-manager.sh analyze
```
---
## Rollback Plan
If fixes cause issues:
```bash
git stash # Stash changes
git checkout 0b1c7bb # Return to last known good commit
make build && ./mev-bot start
```
---
## Success Criteria
- [ ] Security manager enabled
- [ ] Multi-hop scanner finds > 0 paths
- [ ] Rate limit errors < 1% of previous
- [ ] No port binding errors
- [ ] No context cancellation errors
- [ ] At least 1 arbitrage execution attempt per minute
- [ ] Health score > 95/100
---
**Next Step:** Implement Phase 1 fixes (security critical)