Files
mev-beta/docs/planning/04_HIGH-002_Race_Condition_Fixes_Plan.md
Krypto Kajun 850223a953 fix(multicall): resolve critical multicall parsing corruption issues
- Added comprehensive bounds checking to prevent buffer overruns in multicall parsing
- Implemented graduated validation system (Strict/Moderate/Permissive) to reduce false positives
- Added LRU caching system for address validation with 10-minute TTL
- Enhanced ABI decoder with missing Universal Router and Arbitrum-specific DEX signatures
- Fixed duplicate function declarations and import conflicts across multiple files
- Added error recovery mechanisms with multiple fallback strategies
- Updated tests to handle new validation behavior for suspicious addresses
- Fixed parser test expectations for improved validation system
- Applied gofmt formatting fixes to ensure code style compliance
- Fixed mutex copying issues in monitoring package by introducing MetricsSnapshot
- Resolved critical security vulnerabilities in heuristic address extraction
- Progress: Updated TODO audit from 10% to 35% complete

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-17 00:12:55 -05:00

7.6 KiB

HIGH-002: Race Condition Fixes - Detailed Fix Plan

Issue ID: HIGH-002
Category: Security
Priority: High
Status: Not Started
Generated: October 9, 2025
Estimate: 4-5 hours

Overview

This plan addresses multiple race condition vulnerabilities in critical code paths, particularly around shared state access and synchronization. Race conditions can lead to data corruption, inconsistent states, and security vulnerabilities in concurrent environments.

Affected Files and Areas

  • pkg/security/keymanager.go:481,526,531 - Atomic operation consistency
  • pkg/arbitrage/service.go - Shared state protection
  • pkg/scanner/concurrent.go - Worker pool synchronization
  • pkg/transport/provider_manager.go - Connection state management

Implementation Tasks

1. Review All Shared State Access Patterns

Task ID: HIGH-002.1
Time Estimate: 1.5 hours
Dependencies: None

Conduct comprehensive review of shared state access patterns:

  • Identify all shared variables and data structures
  • Map access patterns (read/write operations)
  • Document current synchronization mechanisms
  • Identify potential race conditions
  • Assess risk level for each identified race condition

Focus Areas:

  • In pkg/security/keymanager.go: Review all concurrent access to key data structures
  • In pkg/arbitrage/service.go: Examine order book and pricing data access
  • In pkg/scanner/concurrent.go: Analyze worker state and result handling
  • In pkg/transport/provider_manager.go: Evaluate connection pool management

2. Replace Inconsistent Atomic Usage with Proper Synchronization

Task ID: HIGH-002.2
Time Estimate: 1.5 hours
Dependencies: HIGH-002.1

Fix atomic operation inconsistencies in pkg/security/keymanager.go:481,526,531:

  • Replace inappropriate atomic operations with mutexes where needed
  • Ensure atomic operations are used correctly for simple operations
  • Consolidate access patterns to a consistent approach
  • Add proper synchronization for complex shared state
// Example of fixing atomic usage inconsistency
type KeyManager struct {
    mu       sync.RWMutex
    keys     map[string]*ecdsa.PrivateKey
    counter  *atomic.Int64  // Use atomic for simple counters
}

func (km *KeyManager) GetKey(id string) (*ecdsa.PrivateKey, error) {
    km.mu.RLock()
    defer km.mu.RUnlock()
    
    key, exists := km.keys[id]
    if !exists {
        return nil, fmt.Errorf("key not found: %s", id)
    }
    
    return key, nil
}

func (km *KeyManager) IncrementCounter() {
    km.counter.Add(1)  // Proper atomic operation
}

3. Add Race Detection Tests to CI Pipeline

Task ID: HIGH-002.3
Time Estimate: 0.5 hours
Dependencies: HIGH-002.1, HIGH-002.2

Implement race detection in the CI pipeline:

  • Add -race flag to all Go test commands
  • Configure race detection for integration tests
  • Set up automated race condition testing
  • Monitor for race conditions in pull requests

4. Implement Proper Read-Write Lock Usage

Task ID: HIGH-002.4
Time Estimate: 1 hour
Dependencies: HIGH-002.1, HIGH-002.2

Replace basic mutexes with appropriate read-write locks where the usage pattern is predominantly read-access:

  • In arbitrage service for order book data
  • In scanner for cached results
  • In provider manager for connection state
  • Optimize for read-heavy scenarios
// Example implementation of proper RWLock usage
type ArbitrageService struct {
    mu          sync.RWMutex
    orderBooks  map[string]*OrderBook
    prices      map[string]*big.Float
}

func (as *ArbitrageService) GetPrice(pair string) (*big.Float, error) {
    as.mu.RLock()  // Use read lock for read operations
    defer as.mu.RUnlock()
    
    price, exists := as.prices[pair]
    if !exists {
        return nil, fmt.Errorf("price not found for pair: %s", pair)
    }
    
    return new(big.Float).Set(price), nil
}

func (as *ArbitrageService) UpdatePrice(pair string, price *big.Float) error {
    as.mu.Lock()  // Use write lock for updates
    defer as.mu.Unlock()
    
    as.prices[pair] = new(big.Float).Set(price)
    return nil
}

5. Conduct Comprehensive Race Condition Testing

Task ID: HIGH-002.5
Time Estimate: 0.5 hours
Dependencies: HIGH-002.2, HIGH-002.3, HIGH-002.4

Perform stress testing for race conditions:

  • High-concurrency unit tests
  • Load testing with concurrent access patterns
  • Long-running integration tests
  • Manual verification of synchronized access

Detailed Implementation Steps

In pkg/security/keymanager.go:

import (
    "sync"
    "sync/atomic"
)

type KeyManager struct {
    mu       sync.RWMutex
    keys     map[string]*ecdsa.PrivateKey
    counter  int64  // Changed from inconsistent usage
    // ... other fields
}

func (km *KeyManager) GetKey(id string) (*ecdsa.PrivateKey, error) {
    km.mu.RLock()
    defer km.mu.RUnlock()
    
    key, exists := km.keys[id]
    if !exists {
        return nil, fmt.Errorf("key not found")
    }
    return key, nil
}

func (km *KeyManager) UpdateCounter() {
    atomic.AddInt64(&km.counter, 1)  // Proper atomic usage
}

// Fix lines 481, 526, 531 to use appropriate synchronization
func (km *KeyManager) ProcessKey(id string) error {
    km.mu.Lock()
    defer km.mu.Unlock()
    
    key, exists := km.keys[id]
    if !exists {
        return fmt.Errorf("key does not exist")
    }
    
    // Process key operations with mutex held
    // ...
    return nil
}

In pkg/arbitrage/service.go:

type ArbitrageService struct {
    mu           sync.RWMutex
    orderBooks   map[string]*OrderBook
    strategies   sync.Map  // Use sync.Map for concurrent access
    // ... other fields
}

func (as *ArbitrageService) UpdateOrderBook(pair string, book *OrderBook) error {
    as.mu.Lock()
    defer as.mu.Unlock()
    
    as.orderBooks[pair] = book
    return nil
}

func (as *ArbitrageService) GetOrderBook(pair string) (*OrderBook, error) {
    as.mu.RLock()
    defer as.mu.RUnlock()
    
    book, exists := as.orderBooks[pair]
    if !exists {
        return nil, fmt.Errorf("order book not found for pair %s", pair)
    }
    return book, nil
}

In pkg/scanner/concurrent.go:

type ScannerWorkerPool struct {
    mu        sync.Mutex
    workers   []*Worker
    results   chan *ScanResult
    isActive  atomic.Bool
    // ... other fields
}

func (swp *ScannerWorkerPool) AddWorker(w *Worker) {
    swp.mu.Lock()
    defer swp.mu.Unlock()
    swp.workers = append(swp.workers, w)
}

func (swp *ScannerWorkerPool) SubmitResult(result *ScanResult) {
    // Use non-blocking send or handle channel full
    select {
    case swp.results <- result:
    default:
        // Handle full channel - log error or implement backup strategy
    }
}

Testing Strategy

  • Unit tests with high concurrency
  • Integration tests with race detection enabled
  • Stress testing with concurrent access patterns
  • Manual code review for synchronization logic

Code Review Checklist

  • All shared state is properly synchronized
  • Atomic operations used appropriately for simple values
  • Read-write locks used for read-heavy scenarios
  • Mutexes used for complex state changes
  • Race condition tests pass with -race flag
  • No deadlocks introduced
  • Performance impact is acceptable

Rollback Strategy

If issues arise after deployment:

  1. Revert synchronization changes
  2. Temporarily run in single-threaded mode for critical operations
  3. Monitor performance and stability metrics

Success Metrics

  • No race conditions detected with -race flag enabled
  • All concurrent tests pass consistently
  • No performance degradation beyond acceptable thresholds
  • No deadlocks or lock contention issues
  • Consistent state across all shared resources