mev-beta/docs/RPC_MANAGER_GUIDE.md

# RPC Manager - Round-Robin Load Balancing Guide

## Overview

The **RPC Manager** is a production-grade RPC endpoint management system that provides:

- **Round-Robin Load Balancing**: Distributes RPC calls evenly across multiple endpoints
- **Health Monitoring**: Tracks endpoint health and automatically handles failures
- **Multiple Rotation Policies**: Supports different strategies for endpoint selection
- **Statistics & Metrics**: Provides detailed metrics about RPC usage and health
- **Automatic Failover**: Gracefully handles endpoint failures and recoveries

## Architecture

### Core Components

```
┌─────────────────────────────────────────────┐
│         RPC Manager                         │
│  - Manages endpoint pool                    │
│  - Rotates through endpoints               │
│  - Tracks health metrics                   │
└─────────────────────────────────────────────┘
           ↓
┌─────────────────────────────────────────────┐
│      RPC Endpoint Health Tracker            │
│  - Success/failure counts                  │
│  - Response times                          │
│  - Consecutive failure tracking            │
│  - Health status                           │
└─────────────────────────────────────────────┘
           ↓
┌─────────────────────────────────────────────┐
│    Connection Manager Integration          │
│  - Transparent endpoint selection          │
│  - Automatic client pooling                │
│  - Fallback support                        │
└─────────────────────────────────────────────┘
```

## Rotation Policies

### 1. Round-Robin (Default)
Simple cyclic rotation through all endpoints.
```
Endpoint 1 → Endpoint 2 → Endpoint 3 → Endpoint 1 → ...
```
**Best For**: Uniform load distribution across identical endpoints

### 2. Health-Aware
Prioritizes healthy endpoints, falls back to round-robin if all unhealthy.
```
Healthy endpoints preferred, unhealthy skipped
```
**Best For**: Mixed quality endpoints, avoiding bad ones

### 3. Least-Failures
Always selects endpoint with lowest failure count.
```
Track total failures per endpoint, select best
```
**Best For**: Handling varying endpoint reliability over time

## Usage

### Basic Setup

```go
import (
    "github.com/fraktal/mev-beta/pkg/arbitrum"
    "github.com/fraktal/mev-beta/internal/config"
    "github.com/fraktal/mev-beta/internal/logger"
)

// Create connection manager with round-robin enabled
cfg := &config.ArbitrumConfig{
    RPCEndpoint: "https://primary.rpc.com",
    // ... other config
}

connectionManager := arbitrum.NewConnectionManager(cfg, logger)
connectionManager.EnableRoundRobin(true)
```

### Using Round-Robin Clients

```go
// Create a round-robin client wrapper
rrClient := arbitrum.NewRoundRobinClient(
    connectionManager.rpcManager,
    ctx,
    logger,
)

// For read operations - uses round-robin
client, err := rrClient.GetClientForRead()
if err != nil {
    return err
}

// Perform read operation
result, err := client.ChainID(ctx)

// Record result
if err != nil {
    rrClient.RecordReadFailure()
} else {
    rrClient.RecordReadSuccess(responseTime)
}
```

### Advanced: Initialize with Multiple Endpoints

```go
endpoints := []string{
    "https://rpc1.arbitrum.io",
    "https://rpc2.arbitrum.io",
    "https://rpc3.arbitrum.io",
}

// Initialize round-robin with multiple endpoints
err := arbitrum.InitializeRPCRoundRobin(connectionManager, endpoints)
if err != nil {
    logger.Error(fmt.Sprintf("Failed to initialize round-robin: %v", err))
}

// Set rotation strategy
arbitrum.ConfigureRPCLoadBalancing(connectionManager, arbitrum.HealthAware)
```

### Monitoring RPC Health

```go
// Perform health check on all endpoints
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()

err := connectionManager.PerformRPCHealthCheck(ctx)
if err != nil {
    logger.Warn(fmt.Sprintf("Health check failed: %v", err))
}

// Get detailed statistics
stats := connectionManager.GetRPCManagerStats()
fmt.Println(stats)
// Output:
// {
//   "total_endpoints": 3,
//   "healthy_count": 3,
//   "total_requests": 10234,
//   "total_success": 10000,
//   "total_failure": 234,
//   "success_rate": "97.71%",
//   "current_policy": "health-aware",
//   "endpoint_details": [...]
// }
```

## Integration Examples

### With Batch Fetcher

The batch fetcher automatically benefits from round-robin when the connection manager is configured:

```go
// Connection manager already uses round-robin
client, err := connectionManager.GetClient(ctx)

// Create batch fetcher - will use round-robin automatically
batchFetcher, err := datafetcher.NewBatchFetcher(
    client,
    contractAddr,
    logger,
)

// All batch calls are load-balanced across endpoints
results, err := batchFetcher.FetchPoolsBatch(ctx, poolAddresses)
```

### With Monitor

The Arbitrum monitor automatically uses the round-robin enabled connection manager:

```go
// Create monitor with round-robin connection manager
monitor, err := monitor.NewArbitrumMonitor(
    arbConfig,
    botConfig,
    logger,
    rateLimiter,
    marketMgr,
    scanner,
)

// All RPC calls from monitor are load-balanced
```

## Health Metrics

The RPC Manager tracks comprehensive health metrics for each endpoint:

```go
health, _ := connectionManager.rpcManager.GetEndpointHealth(0)

// Access metrics
success, failure, consecutive, isHealthy := health.GetStats()

fmt.Printf("Endpoint: %s\n", health.URL)
fmt.Printf("  Success: %d, Failure: %d, Consecutive Fails: %d\n",
    success, failure, consecutive)
fmt.Printf("  Healthy: %v, Response Time: %dms\n",
    isHealthy, health.ResponseTime.Milliseconds())
```

### Health Thresholds

- **Marked Unhealthy**: 3+ consecutive failures
- **Recovered**: Next successful call resets consecutive failures
- **Tracked Metrics**: Success count, failure count, response time

## Performance Impact

### Load Distribution
With 3 endpoints using round-robin:
- **Without**: All calls hit endpoint 1 → potential rate limiting
- **With**: Calls distributed evenly → 3x throughput potential

### Response Times
Example with mixed endpoints:
```
Endpoint 1: 50ms avg
Endpoint 2: 200ms avg (poor)
Endpoint 3: 50ms avg

Health-Aware Strategy Results:
- Requests to 1: ~45%
- Requests to 2: ~5% (deprioritized)
- Requests to 3: ~50%

Success Rate: 99.8% (vs 95% without load balancing)
```

## Configuration

### Environment Variables
```bash
# Enable round-robin explicitly
export RPC_ROUNDROBIN_ENABLED=true

# Additional fallback endpoints (comma-separated)
export ARBITRUM_FALLBACK_ENDPOINTS="https://rpc1.io,https://rpc2.io,https://rpc3.io"
```

### Configuration File
```yaml
arbitrum:
  rpc_endpoint: "https://primary.rpc.io"
  rate_limit:
    requests_per_second: 5.0
    burst: 10
  reading_endpoints:
    - url: "https://read1.rpc.io"
    - url: "https://read2.rpc.io"
  execution_endpoints:
    - url: "https://execute1.rpc.io"
```

## Best Practices

### 1. Choose Right Rotation Policy
```go
// For equal-quality endpoints
connectionManager.SetRPCRotationPolicy(arbitrum.RoundRobin)

// For mixed-quality endpoints
connectionManager.SetRPCRotationPolicy(arbitrum.HealthAware)

// For endpoints with varying failures
connectionManager.SetRPCRotationPolicy(arbitrum.LeastFailures)
```

### 2. Monitor Regularly
```go
// Periodic health checks
ticker := time.NewTicker(5 * time.Minute)
for range ticker.C {
    ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
    if err := connectionManager.PerformRPCHealthCheck(ctx); err != nil {
        logger.Error(fmt.Sprintf("Health check failed: %v", err))
    }
    cancel()
}
```

### 3. Handle Errors Gracefully
```go
client, err := rrClient.GetClientForRead()
if err != nil {
    logger.Error(fmt.Sprintf("Failed to get RPC client: %v", err))
    // Implement fallback logic
    return nil, err
}

// Always record results
start := time.Now()
result, err := performRPCCall(client)
elapsed := time.Since(start)

if err != nil {
    rrClient.RecordReadFailure()
} else {
    rrClient.RecordReadSuccess(elapsed)
}
```

### 4. Optimize Batch Sizes
```go
// RPC Manager works best with batch operations
// Reduce individual calls, increase batch size

// ❌ Avoid: Many small individual calls
for _, pool := range pools {
    data, _ := client.CallContract(ctx, call)
}

// ✅ Better: Batch operations
batchFetcher.FetchPoolsBatch(ctx, pools)
```

## Troubleshooting

### All Endpoints Unhealthy
```
Error: "no healthy endpoints available"

Solution: Check endpoint status and logs
- Perform manual health check
- Verify network connectivity
- Check RPC provider status
- Review error logs for specific failures
```

### High Failure Rate
```go
stats := connectionManager.GetRPCManagerStats()
if stats["success_rate"].(string) < "95%" {
    logger.Warn("High RPC failure rate detected")
    // Switch to more lenient policy
    connectionManager.SetRPCRotationPolicy(arbitrum.LeastFailures)
}
```

### Uneven Load Distribution
```go
// Check distribution
stats := connectionManager.GetRPCManagerStats()
details := stats["endpoint_details"].([]map[string]interface{})

for _, endpoint := range details {
    fmt.Printf("%s: %d requests\n",
        endpoint["url"],
        endpoint["success_count"])
}
```

## Metrics Reference

### Tracked Metrics
- **Total Requests**: Sum of all successful and failed calls
- **Success Rate**: Percentage of successful calls
- **Response Times**: Min, max, average per endpoint
- **Consecutive Failures**: Track for health status
- **Endpoint Status**: Healthy/Unhealthy state

### Export Format
```json
{
  "total_endpoints": 3,
  "healthy_count": 3,
  "total_requests": 15234,
  "total_success": 14892,
  "total_failure": 342,
  "success_rate": "97.76%",
  "current_policy": "health-aware",
  "endpoint_details": [
    {
      "index": 0,
      "url": "https://rpc1.io",
      "success_count": 5023,
      "failure_count": 89,
      "consecutive_fails": 0,
      "is_healthy": true,
      "last_checked": "2025-11-03T12:34:56Z",
      "response_time_ms": 45
    }
  ]
}
```

## API Reference

### RPCManager
```go
type RPCManager struct
    NewRPCManager(logger) *RPCManager
    AddEndpoint(client, url) error
    GetNextClient(ctx) (*RateLimitedClient, int, error)
    RecordSuccess(idx, responseTime)
    RecordFailure(idx)
    GetEndpointHealth(idx) (*RPCEndpointHealth, error)
    GetAllHealthStats() []map[string]interface{}
    SetRotationPolicy(policy RotationPolicy)
    HealthCheckAll(ctx) error
    GetStats() map[string]interface{}
    Close() error
```

### ConnectionManager Extensions
```go
func (cm *ConnectionManager) EnableRoundRobin(enabled bool)
func (cm *ConnectionManager) SetRPCRotationPolicy(policy RotationPolicy)
func (cm *ConnectionManager) GetRPCManagerStats() map[string]interface{}
func (cm *ConnectionManager) PerformRPCHealthCheck(ctx) error
```

### Helper Functions
```go
NewRoundRobinClient(manager, ctx, logger) *RoundRobinClient
InitializeRPCRoundRobin(cm, endpoints) error
ConfigureRPCLoadBalancing(cm, strategy) error
GetConnectionManagerWithRoundRobin(cfg, logger, endpoints) (*ConnectionManager, error)
```

## Future Enhancements

Planned improvements to RPC Manager:

1. **Weighted Round-Robin**: Assign weights based on historical performance
2. **Dynamic Endpoint Discovery**: Auto-discover and add new endpoints
3. **Regional Failover**: Prefer endpoints in same region for latency
4. **Cost Tracking**: Monitor and report RPC call costs
5. **Analytics Dashboard**: Real-time visualization of RPC metrics
6. **Adaptive Timeouts**: Adjust timeouts based on endpoint performance
7. **Request Queueing**: Smart queuing during RPC overload

## Conclusion

The RPC Manager provides enterprise-grade RPC endpoint management, enabling:
- **Reliability**: Automatic failover and health monitoring
- **Performance**: Optimized load distribution
- **Visibility**: Comprehensive metrics and statistics
- **Flexibility**: Multiple rotation strategies for different needs

For production deployments, RPC Manager is essential to prevent single-endpoint rate limiting and ensure robust transaction processing.