Files
mev-beta/docs/RPC_MANAGER_GUIDE.md

455 lines
12 KiB
Markdown

# RPC Manager - Round-Robin Load Balancing Guide
## Overview
The **RPC Manager** is a production-grade RPC endpoint management system that provides:
- **Round-Robin Load Balancing**: Distributes RPC calls evenly across multiple endpoints
- **Health Monitoring**: Tracks endpoint health and automatically handles failures
- **Multiple Rotation Policies**: Supports different strategies for endpoint selection
- **Statistics & Metrics**: Provides detailed metrics about RPC usage and health
- **Automatic Failover**: Gracefully handles endpoint failures and recoveries
## Architecture
### Core Components
```
┌─────────────────────────────────────────────┐
│ RPC Manager │
│ - Manages endpoint pool │
│ - Rotates through endpoints │
│ - Tracks health metrics │
└─────────────────────────────────────────────┘
┌─────────────────────────────────────────────┐
│ RPC Endpoint Health Tracker │
│ - Success/failure counts │
│ - Response times │
│ - Consecutive failure tracking │
│ - Health status │
└─────────────────────────────────────────────┘
┌─────────────────────────────────────────────┐
│ Connection Manager Integration │
│ - Transparent endpoint selection │
│ - Automatic client pooling │
│ - Fallback support │
└─────────────────────────────────────────────┘
```
## Rotation Policies
### 1. Round-Robin (Default)
Simple cyclic rotation through all endpoints.
```
Endpoint 1 → Endpoint 2 → Endpoint 3 → Endpoint 1 → ...
```
**Best For**: Uniform load distribution across identical endpoints
### 2. Health-Aware
Prioritizes healthy endpoints, falls back to round-robin if all unhealthy.
```
Healthy endpoints preferred, unhealthy skipped
```
**Best For**: Mixed quality endpoints, avoiding bad ones
### 3. Least-Failures
Always selects endpoint with lowest failure count.
```
Track total failures per endpoint, select best
```
**Best For**: Handling varying endpoint reliability over time
## Usage
### Basic Setup
```go
import (
"github.com/fraktal/mev-beta/pkg/arbitrum"
"github.com/fraktal/mev-beta/internal/config"
"github.com/fraktal/mev-beta/internal/logger"
)
// Create connection manager with round-robin enabled
cfg := &config.ArbitrumConfig{
RPCEndpoint: "https://primary.rpc.com",
// ... other config
}
connectionManager := arbitrum.NewConnectionManager(cfg, logger)
connectionManager.EnableRoundRobin(true)
```
### Using Round-Robin Clients
```go
// Create a round-robin client wrapper
rrClient := arbitrum.NewRoundRobinClient(
connectionManager.rpcManager,
ctx,
logger,
)
// For read operations - uses round-robin
client, err := rrClient.GetClientForRead()
if err != nil {
return err
}
// Perform read operation
result, err := client.ChainID(ctx)
// Record result
if err != nil {
rrClient.RecordReadFailure()
} else {
rrClient.RecordReadSuccess(responseTime)
}
```
### Advanced: Initialize with Multiple Endpoints
```go
endpoints := []string{
"https://rpc1.arbitrum.io",
"https://rpc2.arbitrum.io",
"https://rpc3.arbitrum.io",
}
// Initialize round-robin with multiple endpoints
err := arbitrum.InitializeRPCRoundRobin(connectionManager, endpoints)
if err != nil {
logger.Error(fmt.Sprintf("Failed to initialize round-robin: %v", err))
}
// Set rotation strategy
arbitrum.ConfigureRPCLoadBalancing(connectionManager, arbitrum.HealthAware)
```
### Monitoring RPC Health
```go
// Perform health check on all endpoints
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
err := connectionManager.PerformRPCHealthCheck(ctx)
if err != nil {
logger.Warn(fmt.Sprintf("Health check failed: %v", err))
}
// Get detailed statistics
stats := connectionManager.GetRPCManagerStats()
fmt.Println(stats)
// Output:
// {
// "total_endpoints": 3,
// "healthy_count": 3,
// "total_requests": 10234,
// "total_success": 10000,
// "total_failure": 234,
// "success_rate": "97.71%",
// "current_policy": "health-aware",
// "endpoint_details": [...]
// }
```
## Integration Examples
### With Batch Fetcher
The batch fetcher automatically benefits from round-robin when the connection manager is configured:
```go
// Connection manager already uses round-robin
client, err := connectionManager.GetClient(ctx)
// Create batch fetcher - will use round-robin automatically
batchFetcher, err := datafetcher.NewBatchFetcher(
client,
contractAddr,
logger,
)
// All batch calls are load-balanced across endpoints
results, err := batchFetcher.FetchPoolsBatch(ctx, poolAddresses)
```
### With Monitor
The Arbitrum monitor automatically uses the round-robin enabled connection manager:
```go
// Create monitor with round-robin connection manager
monitor, err := monitor.NewArbitrumMonitor(
arbConfig,
botConfig,
logger,
rateLimiter,
marketMgr,
scanner,
)
// All RPC calls from monitor are load-balanced
```
## Health Metrics
The RPC Manager tracks comprehensive health metrics for each endpoint:
```go
health, _ := connectionManager.rpcManager.GetEndpointHealth(0)
// Access metrics
success, failure, consecutive, isHealthy := health.GetStats()
fmt.Printf("Endpoint: %s\n", health.URL)
fmt.Printf(" Success: %d, Failure: %d, Consecutive Fails: %d\n",
success, failure, consecutive)
fmt.Printf(" Healthy: %v, Response Time: %dms\n",
isHealthy, health.ResponseTime.Milliseconds())
```
### Health Thresholds
- **Marked Unhealthy**: 3+ consecutive failures
- **Recovered**: Next successful call resets consecutive failures
- **Tracked Metrics**: Success count, failure count, response time
## Performance Impact
### Load Distribution
With 3 endpoints using round-robin:
- **Without**: All calls hit endpoint 1 → potential rate limiting
- **With**: Calls distributed evenly → 3x throughput potential
### Response Times
Example with mixed endpoints:
```
Endpoint 1: 50ms avg
Endpoint 2: 200ms avg (poor)
Endpoint 3: 50ms avg
Health-Aware Strategy Results:
- Requests to 1: ~45%
- Requests to 2: ~5% (deprioritized)
- Requests to 3: ~50%
Success Rate: 99.8% (vs 95% without load balancing)
```
## Configuration
### Environment Variables
```bash
# Enable round-robin explicitly
export RPC_ROUNDROBIN_ENABLED=true
# Additional fallback endpoints (comma-separated)
export ARBITRUM_FALLBACK_ENDPOINTS="https://rpc1.io,https://rpc2.io,https://rpc3.io"
```
### Configuration File
```yaml
arbitrum:
rpc_endpoint: "https://primary.rpc.io"
rate_limit:
requests_per_second: 5.0
burst: 10
reading_endpoints:
- url: "https://read1.rpc.io"
- url: "https://read2.rpc.io"
execution_endpoints:
- url: "https://execute1.rpc.io"
```
## Best Practices
### 1. Choose Right Rotation Policy
```go
// For equal-quality endpoints
connectionManager.SetRPCRotationPolicy(arbitrum.RoundRobin)
// For mixed-quality endpoints
connectionManager.SetRPCRotationPolicy(arbitrum.HealthAware)
// For endpoints with varying failures
connectionManager.SetRPCRotationPolicy(arbitrum.LeastFailures)
```
### 2. Monitor Regularly
```go
// Periodic health checks
ticker := time.NewTicker(5 * time.Minute)
for range ticker.C {
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
if err := connectionManager.PerformRPCHealthCheck(ctx); err != nil {
logger.Error(fmt.Sprintf("Health check failed: %v", err))
}
cancel()
}
```
### 3. Handle Errors Gracefully
```go
client, err := rrClient.GetClientForRead()
if err != nil {
logger.Error(fmt.Sprintf("Failed to get RPC client: %v", err))
// Implement fallback logic
return nil, err
}
// Always record results
start := time.Now()
result, err := performRPCCall(client)
elapsed := time.Since(start)
if err != nil {
rrClient.RecordReadFailure()
} else {
rrClient.RecordReadSuccess(elapsed)
}
```
### 4. Optimize Batch Sizes
```go
// RPC Manager works best with batch operations
// Reduce individual calls, increase batch size
// ❌ Avoid: Many small individual calls
for _, pool := range pools {
data, _ := client.CallContract(ctx, call)
}
// ✅ Better: Batch operations
batchFetcher.FetchPoolsBatch(ctx, pools)
```
## Troubleshooting
### All Endpoints Unhealthy
```
Error: "no healthy endpoints available"
Solution: Check endpoint status and logs
- Perform manual health check
- Verify network connectivity
- Check RPC provider status
- Review error logs for specific failures
```
### High Failure Rate
```go
stats := connectionManager.GetRPCManagerStats()
if stats["success_rate"].(string) < "95%" {
logger.Warn("High RPC failure rate detected")
// Switch to more lenient policy
connectionManager.SetRPCRotationPolicy(arbitrum.LeastFailures)
}
```
### Uneven Load Distribution
```go
// Check distribution
stats := connectionManager.GetRPCManagerStats()
details := stats["endpoint_details"].([]map[string]interface{})
for _, endpoint := range details {
fmt.Printf("%s: %d requests\n",
endpoint["url"],
endpoint["success_count"])
}
```
## Metrics Reference
### Tracked Metrics
- **Total Requests**: Sum of all successful and failed calls
- **Success Rate**: Percentage of successful calls
- **Response Times**: Min, max, average per endpoint
- **Consecutive Failures**: Track for health status
- **Endpoint Status**: Healthy/Unhealthy state
### Export Format
```json
{
"total_endpoints": 3,
"healthy_count": 3,
"total_requests": 15234,
"total_success": 14892,
"total_failure": 342,
"success_rate": "97.76%",
"current_policy": "health-aware",
"endpoint_details": [
{
"index": 0,
"url": "https://rpc1.io",
"success_count": 5023,
"failure_count": 89,
"consecutive_fails": 0,
"is_healthy": true,
"last_checked": "2025-11-03T12:34:56Z",
"response_time_ms": 45
}
]
}
```
## API Reference
### RPCManager
```go
type RPCManager struct
NewRPCManager(logger) *RPCManager
AddEndpoint(client, url) error
GetNextClient(ctx) (*RateLimitedClient, int, error)
RecordSuccess(idx, responseTime)
RecordFailure(idx)
GetEndpointHealth(idx) (*RPCEndpointHealth, error)
GetAllHealthStats() []map[string]interface{}
SetRotationPolicy(policy RotationPolicy)
HealthCheckAll(ctx) error
GetStats() map[string]interface{}
Close() error
```
### ConnectionManager Extensions
```go
func (cm *ConnectionManager) EnableRoundRobin(enabled bool)
func (cm *ConnectionManager) SetRPCRotationPolicy(policy RotationPolicy)
func (cm *ConnectionManager) GetRPCManagerStats() map[string]interface{}
func (cm *ConnectionManager) PerformRPCHealthCheck(ctx) error
```
### Helper Functions
```go
NewRoundRobinClient(manager, ctx, logger) *RoundRobinClient
InitializeRPCRoundRobin(cm, endpoints) error
ConfigureRPCLoadBalancing(cm, strategy) error
GetConnectionManagerWithRoundRobin(cfg, logger, endpoints) (*ConnectionManager, error)
```
## Future Enhancements
Planned improvements to RPC Manager:
1. **Weighted Round-Robin**: Assign weights based on historical performance
2. **Dynamic Endpoint Discovery**: Auto-discover and add new endpoints
3. **Regional Failover**: Prefer endpoints in same region for latency
4. **Cost Tracking**: Monitor and report RPC call costs
5. **Analytics Dashboard**: Real-time visualization of RPC metrics
6. **Adaptive Timeouts**: Adjust timeouts based on endpoint performance
7. **Request Queueing**: Smart queuing during RPC overload
## Conclusion
The RPC Manager provides enterprise-grade RPC endpoint management, enabling:
- **Reliability**: Automatic failover and health monitoring
- **Performance**: Optimized load distribution
- **Visibility**: Comprehensive metrics and statistics
- **Flexibility**: Multiple rotation strategies for different needs
For production deployments, RPC Manager is essential to prevent single-endpoint rate limiting and ensure robust transaction processing.