Files
mev-beta/docs/RPC_MANAGER_GUIDE.md

12 KiB

RPC Manager - Round-Robin Load Balancing Guide

Overview

The RPC Manager is a production-grade RPC endpoint management system that provides:

  • Round-Robin Load Balancing: Distributes RPC calls evenly across multiple endpoints
  • Health Monitoring: Tracks endpoint health and automatically handles failures
  • Multiple Rotation Policies: Supports different strategies for endpoint selection
  • Statistics & Metrics: Provides detailed metrics about RPC usage and health
  • Automatic Failover: Gracefully handles endpoint failures and recoveries

Architecture

Core Components

┌─────────────────────────────────────────────┐
│         RPC Manager                         │
│  - Manages endpoint pool                    │
│  - Rotates through endpoints               │
│  - Tracks health metrics                   │
└─────────────────────────────────────────────┘
           ↓
┌─────────────────────────────────────────────┐
│      RPC Endpoint Health Tracker            │
│  - Success/failure counts                  │
│  - Response times                          │
│  - Consecutive failure tracking            │
│  - Health status                           │
└─────────────────────────────────────────────┘
           ↓
┌─────────────────────────────────────────────┐
│    Connection Manager Integration          │
│  - Transparent endpoint selection          │
│  - Automatic client pooling                │
│  - Fallback support                        │
└─────────────────────────────────────────────┘

Rotation Policies

1. Round-Robin (Default)

Simple cyclic rotation through all endpoints.

Endpoint 1 → Endpoint 2 → Endpoint 3 → Endpoint 1 → ...

Best For: Uniform load distribution across identical endpoints

2. Health-Aware

Prioritizes healthy endpoints, falls back to round-robin if all unhealthy.

Healthy endpoints preferred, unhealthy skipped

Best For: Mixed quality endpoints, avoiding bad ones

3. Least-Failures

Always selects endpoint with lowest failure count.

Track total failures per endpoint, select best

Best For: Handling varying endpoint reliability over time

Usage

Basic Setup

import (
    "github.com/fraktal/mev-beta/pkg/arbitrum"
    "github.com/fraktal/mev-beta/internal/config"
    "github.com/fraktal/mev-beta/internal/logger"
)

// Create connection manager with round-robin enabled
cfg := &config.ArbitrumConfig{
    RPCEndpoint: "https://primary.rpc.com",
    // ... other config
}

connectionManager := arbitrum.NewConnectionManager(cfg, logger)
connectionManager.EnableRoundRobin(true)

Using Round-Robin Clients

// Create a round-robin client wrapper
rrClient := arbitrum.NewRoundRobinClient(
    connectionManager.rpcManager,
    ctx,
    logger,
)

// For read operations - uses round-robin
client, err := rrClient.GetClientForRead()
if err != nil {
    return err
}

// Perform read operation
result, err := client.ChainID(ctx)

// Record result
if err != nil {
    rrClient.RecordReadFailure()
} else {
    rrClient.RecordReadSuccess(responseTime)
}

Advanced: Initialize with Multiple Endpoints

endpoints := []string{
    "https://rpc1.arbitrum.io",
    "https://rpc2.arbitrum.io",
    "https://rpc3.arbitrum.io",
}

// Initialize round-robin with multiple endpoints
err := arbitrum.InitializeRPCRoundRobin(connectionManager, endpoints)
if err != nil {
    logger.Error(fmt.Sprintf("Failed to initialize round-robin: %v", err))
}

// Set rotation strategy
arbitrum.ConfigureRPCLoadBalancing(connectionManager, arbitrum.HealthAware)

Monitoring RPC Health

// Perform health check on all endpoints
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()

err := connectionManager.PerformRPCHealthCheck(ctx)
if err != nil {
    logger.Warn(fmt.Sprintf("Health check failed: %v", err))
}

// Get detailed statistics
stats := connectionManager.GetRPCManagerStats()
fmt.Println(stats)
// Output:
// {
//   "total_endpoints": 3,
//   "healthy_count": 3,
//   "total_requests": 10234,
//   "total_success": 10000,
//   "total_failure": 234,
//   "success_rate": "97.71%",
//   "current_policy": "health-aware",
//   "endpoint_details": [...]
// }

Integration Examples

With Batch Fetcher

The batch fetcher automatically benefits from round-robin when the connection manager is configured:

// Connection manager already uses round-robin
client, err := connectionManager.GetClient(ctx)

// Create batch fetcher - will use round-robin automatically
batchFetcher, err := datafetcher.NewBatchFetcher(
    client,
    contractAddr,
    logger,
)

// All batch calls are load-balanced across endpoints
results, err := batchFetcher.FetchPoolsBatch(ctx, poolAddresses)

With Monitor

The Arbitrum monitor automatically uses the round-robin enabled connection manager:

// Create monitor with round-robin connection manager
monitor, err := monitor.NewArbitrumMonitor(
    arbConfig,
    botConfig,
    logger,
    rateLimiter,
    marketMgr,
    scanner,
)

// All RPC calls from monitor are load-balanced

Health Metrics

The RPC Manager tracks comprehensive health metrics for each endpoint:

health, _ := connectionManager.rpcManager.GetEndpointHealth(0)

// Access metrics
success, failure, consecutive, isHealthy := health.GetStats()

fmt.Printf("Endpoint: %s\n", health.URL)
fmt.Printf("  Success: %d, Failure: %d, Consecutive Fails: %d\n",
    success, failure, consecutive)
fmt.Printf("  Healthy: %v, Response Time: %dms\n",
    isHealthy, health.ResponseTime.Milliseconds())

Health Thresholds

  • Marked Unhealthy: 3+ consecutive failures
  • Recovered: Next successful call resets consecutive failures
  • Tracked Metrics: Success count, failure count, response time

Performance Impact

Load Distribution

With 3 endpoints using round-robin:

  • Without: All calls hit endpoint 1 → potential rate limiting
  • With: Calls distributed evenly → 3x throughput potential

Response Times

Example with mixed endpoints:

Endpoint 1: 50ms avg
Endpoint 2: 200ms avg (poor)
Endpoint 3: 50ms avg

Health-Aware Strategy Results:
- Requests to 1: ~45%
- Requests to 2: ~5% (deprioritized)
- Requests to 3: ~50%

Success Rate: 99.8% (vs 95% without load balancing)

Configuration

Environment Variables

# Enable round-robin explicitly
export RPC_ROUNDROBIN_ENABLED=true

# Additional fallback endpoints (comma-separated)
export ARBITRUM_FALLBACK_ENDPOINTS="https://rpc1.io,https://rpc2.io,https://rpc3.io"

Configuration File

arbitrum:
  rpc_endpoint: "https://primary.rpc.io"
  rate_limit:
    requests_per_second: 5.0
    burst: 10
  reading_endpoints:
    - url: "https://read1.rpc.io"
    - url: "https://read2.rpc.io"
  execution_endpoints:
    - url: "https://execute1.rpc.io"

Best Practices

1. Choose Right Rotation Policy

// For equal-quality endpoints
connectionManager.SetRPCRotationPolicy(arbitrum.RoundRobin)

// For mixed-quality endpoints
connectionManager.SetRPCRotationPolicy(arbitrum.HealthAware)

// For endpoints with varying failures
connectionManager.SetRPCRotationPolicy(arbitrum.LeastFailures)

2. Monitor Regularly

// Periodic health checks
ticker := time.NewTicker(5 * time.Minute)
for range ticker.C {
    ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
    if err := connectionManager.PerformRPCHealthCheck(ctx); err != nil {
        logger.Error(fmt.Sprintf("Health check failed: %v", err))
    }
    cancel()
}

3. Handle Errors Gracefully

client, err := rrClient.GetClientForRead()
if err != nil {
    logger.Error(fmt.Sprintf("Failed to get RPC client: %v", err))
    // Implement fallback logic
    return nil, err
}

// Always record results
start := time.Now()
result, err := performRPCCall(client)
elapsed := time.Since(start)

if err != nil {
    rrClient.RecordReadFailure()
} else {
    rrClient.RecordReadSuccess(elapsed)
}

4. Optimize Batch Sizes

// RPC Manager works best with batch operations
// Reduce individual calls, increase batch size

// ❌ Avoid: Many small individual calls
for _, pool := range pools {
    data, _ := client.CallContract(ctx, call)
}

// ✅ Better: Batch operations
batchFetcher.FetchPoolsBatch(ctx, pools)

Troubleshooting

All Endpoints Unhealthy

Error: "no healthy endpoints available"

Solution: Check endpoint status and logs
- Perform manual health check
- Verify network connectivity
- Check RPC provider status
- Review error logs for specific failures

High Failure Rate

stats := connectionManager.GetRPCManagerStats()
if stats["success_rate"].(string) < "95%" {
    logger.Warn("High RPC failure rate detected")
    // Switch to more lenient policy
    connectionManager.SetRPCRotationPolicy(arbitrum.LeastFailures)
}

Uneven Load Distribution

// Check distribution
stats := connectionManager.GetRPCManagerStats()
details := stats["endpoint_details"].([]map[string]interface{})

for _, endpoint := range details {
    fmt.Printf("%s: %d requests\n",
        endpoint["url"],
        endpoint["success_count"])
}

Metrics Reference

Tracked Metrics

  • Total Requests: Sum of all successful and failed calls
  • Success Rate: Percentage of successful calls
  • Response Times: Min, max, average per endpoint
  • Consecutive Failures: Track for health status
  • Endpoint Status: Healthy/Unhealthy state

Export Format

{
  "total_endpoints": 3,
  "healthy_count": 3,
  "total_requests": 15234,
  "total_success": 14892,
  "total_failure": 342,
  "success_rate": "97.76%",
  "current_policy": "health-aware",
  "endpoint_details": [
    {
      "index": 0,
      "url": "https://rpc1.io",
      "success_count": 5023,
      "failure_count": 89,
      "consecutive_fails": 0,
      "is_healthy": true,
      "last_checked": "2025-11-03T12:34:56Z",
      "response_time_ms": 45
    }
  ]
}

API Reference

RPCManager

type RPCManager struct
    NewRPCManager(logger) *RPCManager
    AddEndpoint(client, url) error
    GetNextClient(ctx) (*RateLimitedClient, int, error)
    RecordSuccess(idx, responseTime)
    RecordFailure(idx)
    GetEndpointHealth(idx) (*RPCEndpointHealth, error)
    GetAllHealthStats() []map[string]interface{}
    SetRotationPolicy(policy RotationPolicy)
    HealthCheckAll(ctx) error
    GetStats() map[string]interface{}
    Close() error

ConnectionManager Extensions

func (cm *ConnectionManager) EnableRoundRobin(enabled bool)
func (cm *ConnectionManager) SetRPCRotationPolicy(policy RotationPolicy)
func (cm *ConnectionManager) GetRPCManagerStats() map[string]interface{}
func (cm *ConnectionManager) PerformRPCHealthCheck(ctx) error

Helper Functions

NewRoundRobinClient(manager, ctx, logger) *RoundRobinClient
InitializeRPCRoundRobin(cm, endpoints) error
ConfigureRPCLoadBalancing(cm, strategy) error
GetConnectionManagerWithRoundRobin(cfg, logger, endpoints) (*ConnectionManager, error)

Future Enhancements

Planned improvements to RPC Manager:

  1. Weighted Round-Robin: Assign weights based on historical performance
  2. Dynamic Endpoint Discovery: Auto-discover and add new endpoints
  3. Regional Failover: Prefer endpoints in same region for latency
  4. Cost Tracking: Monitor and report RPC call costs
  5. Analytics Dashboard: Real-time visualization of RPC metrics
  6. Adaptive Timeouts: Adjust timeouts based on endpoint performance
  7. Request Queueing: Smart queuing during RPC overload

Conclusion

The RPC Manager provides enterprise-grade RPC endpoint management, enabling:

  • Reliability: Automatic failover and health monitoring
  • Performance: Optimized load distribution
  • Visibility: Comprehensive metrics and statistics
  • Flexibility: Multiple rotation strategies for different needs

For production deployments, RPC Manager is essential to prevent single-endpoint rate limiting and ensure robust transaction processing.