mev-beta/docs/RPC_MULTI_PROVIDER_SETUP_20251031.md

# Multi-Provider RPC Configuration - Rate Limit Solution
**Date**: October 31, 2025 19:00 CDT
**Status**: ✅ IMPLEMENTED - Multi-RPC rotation with failover

## Problem Statement
The MEV bot was experiencing critical rate limiting issues:
- **878 rate limit errors in 4 minutes** (~220/minute)
- 44.79% of pool fetch failures due to 429 errors
- Using single free RPC endpoint (https://arb1.arbitrum.io/rpc)
- Rate limit set too low (5 req/sec)
- No failover or rotation

## Solution Implemented

### 1. Multiple Diverse RPC Providers (7 Total)

Added 6 additional free/public Arbitrum RPC endpoints:

| Provider | Type | Endpoint | Rate Limit | Priority |
|----------|------|----------|------------|----------|
| Arbitrum Public HTTP | HTTP | https://arb1.arbitrum.io/rpc | 10 req/s | 10 |
| Arbitrum Public WS | WSS | wss://arb1.arbitrum.io/ws | 15 req/s | 10 |
| Chainlist RPC 1 | HTTP/WSS | arbitrum-one.publicnode.com | 12 req/s | 9 |
| Chainlist RPC 2 | HTTP/WSS | rpc.ankr.com/arbitrum | 12 req/s | 8 |
| Chainlist RPC 3 | HTTP/WSS | arbitrum.blockpi.network | 10 req/s | 7 |
| LlamaNodes | HTTP/WSS | arbitrum.llamarpc.com | 10 req/s | 6 |
| Alchemy Free | HTTP/WSS | arb-mainnet.g.alchemy.com | 15 req/s | 5 |

**Total Capacity**: ~84 requests/second across all providers

### 2. Round-Robin Rotation Strategy

Changed from `priority_based` to `round_robin` distribution:

```yaml
rotation:
  strategy: round_robin  # Distribute load evenly
  fallback_enabled: true
  health_check_required: true
  retry_failed_after: 2m
  auto_rotate_interval: 30s  # Rotate every 30 seconds
  failover_on_rate_limit: true  # Immediate switch on 429
```

### 3. Intelligent Failover

**Circuit Breaker Configuration**:
```yaml
circuit_breaker:
  enabled: true
  failure_threshold: 5  # Switch after 5 failures
  success_threshold: 2  # Re-enable after 2 successes
  timeout: 60s
  half_open_requests: 3
```

**Retry Logic**:
- Max attempts: 5 (across different providers)
- Exponential backoff: 500ms → 1s → 2s → 4s → 5s
- Jitter enabled to prevent thundering herd

### 4. Provider Pools

**Execution Pool** (for transactions):
- 4 providers with highest reliability
- Strategy: round_robin
- Max connections: 15

**Read-Only Pool** (for DataFetcher):
- 6 providers for maximum capacity
- Strategy: round_robin
- Max connections: 20

### 5. Enhanced Rate Limits

Increased per-provider limits:
- HTTP endpoints: 10-12 req/s (was 5)
- WSS endpoints: 15 req/s (was 5)
- Burst capacity: 25-40 (was 10)
- Timeout: 45-60s (was 30s)

## Expected Impact

### Before (Single Provider)
```
Provider: Arbitrum Public HTTP only
Rate Limit: 5 req/s
429 Errors: 878 in 4 minutes
Success Rate: ~55% (44.79% rate limit failures)
```

### After (Multi-Provider)
```
Providers: 7 diverse endpoints
Total Capacity: ~84 req/s (16.8x increase)
Expected 429 Errors: <10 per hour (99% reduction)
Expected Success Rate: >98%
```

## Configuration Files

### Primary Config
- **File**: `config/providers_runtime.yaml`
- **Backup**: `config/providers_runtime.yaml.backup_single_rpc_*`

### Key Changes
1. Added 6 new providers
2. Increased rate limits by 2-3x per provider
3. Changed strategy from priority_based → round_robin
4. Added circuit breaker and retry configuration
5. Enabled automatic rotation every 30s

## Testing Results

### Build Status
✅ Build successful with new configuration

### Startup Test
✅ Bot started successfully
✅ All providers initialized
✅ Round-robin rotation active

### Next Steps
1. Run 2-hour test to verify 429 error reduction
2. Monitor provider health and rotation
3. Fine-tune rate limits based on actual usage
4. Consider upgrading to paid tier if free limits still exceeded

## Provider Selection Criteria

All providers selected based on:
1. **Reliability**: Listed on Chainlist or official docs
2. **Free Tier**: No API key required for basic use
3. **Rate Limits**: Reasonable limits for testing
4. **Geographic Diversity**: Different infrastructure providers
5. **Both HTTP & WSS**: Support for both protocols

## Fallback Strategy

If all free providers hit rate limits:
1. Circuit breaker will open
2. Retry with exponential backoff
3. Wait 60s before retry
4. Log alert for manual intervention

**Recommended**: Upgrade to paid tier ($100-300/month) if sustained high load

## Monitoring

Monitor these metrics:
```bash
# Check provider health
tail -f logs/mev-bot.log | grep -i "provider\|rotation\|429"

# Count rate limit errors
grep -c "429 Too Many Requests" logs/mev-bot_errors.log

# Check which providers are being used
grep "Using provider" logs/mev-bot.log | tail -20
```

## Cost Analysis

### Current (Free Tier)
- Cost: $0/month
- Capacity: ~84 req/s total
- Limitations: Subject to rate limits

### Recommended (Paid Tier)
- Provider: Alchemy/Infura/QuickNode
- Cost: $100-300/month
- Capacity: 300-1000+ req/s
- Benefits:
  - Dedicated capacity
  - Higher reliability
  - Better SLAs
  - Archive node access

## Implementation Notes

### Code Changes Required
None - existing provider manager already supports:
- ✅ Round-robin rotation
- ✅ Health checks
- ✅ Failover
- ✅ Circuit breaker
- ✅ Retry logic

### Configuration Changes Only
All improvements achieved through YAML configuration updates only.

## Rollback Procedure

If issues occur:
```bash
# Restore previous config
cp config/providers_runtime.yaml.backup_single_rpc_* config/providers_runtime.yaml

# Rebuild
./scripts/build.sh

# Restart bot
./bin/mev-beta start
```

## Success Metrics

Track these KPIs:
1. **Rate Limit Errors**: Should drop from 878/4min to <10/hour
2. **Pool Fetch Success Rate**: Should increase from 55% to >98%
3. **Provider Health**: All 7 providers should maintain >90% uptime
4. **Rotation**: Should rotate between providers every 30s
5. **Response Time**: Should average <150ms (was ~105ms)

## Documentation References

- Provider Manager: `pkg/transport/provider_manager.go`
- Failover Logic: `pkg/transport/failover.go`
- Provider Pools: `pkg/transport/provider_pools.go`
- Rate Limiting: `pkg/arbitrum/rate_limited_rpc.go`

---

## Appendix: Provider Details

### Free Public Arbitrum RPC Endpoints

1. **Official Arbitrum**
   - HTTP: https://arb1.arbitrum.io/rpc
   - WSS: wss://arb1.arbitrum.io/ws
   - Limit: ~10 req/s
   - Reliability: High (official)

2. **PublicNode (Chainlist)**
   - HTTP/WSS: arbitrum-one.publicnode.com
   - Limit: ~12 req/s
   - Reliability: High
   - Features: Both HTTP and WSS

3. **Ankr (Chainlist)**
   - HTTP/WSS: rpc.ankr.com/arbitrum
   - Limit: ~12 req/s
   - Reliability: High
   - Features: Professional infrastructure

4. **BlockPI (Chainlist)**
   - HTTP/WSS: arbitrum.blockpi.network
   - Limit: ~10 req/s
   - Reliability: Medium-High
   - Features: Public access

5. **LlamaNodes**
   - HTTP/WSS: arbitrum.llamarpc.com
   - Limit: ~10 req/s
   - Reliability: Medium
   - Features: Community-maintained

6. **Alchemy Free Tier**
   - HTTP/WSS: arb-mainnet.g.alchemy.com/v2/demo
   - Limit: ~15 req/s
   - Reliability: High
   - Features: Demo key (upgrade available)

### Recommended Paid Providers

For production use:

1. **Alchemy** ($49-299/month)
   - 300M+ compute units
   - Archive node access
   - Enhanced APIs

2. **Infura** ($50-225/month)
   - 100K-1M+ requests/day
   - Reliable infrastructure
   - Good documentation

3. **QuickNode** ($49-299/month)
   - Dedicated nodes
   - Global coverage
   - Premium support

---

**Status**: Ready for production testing with 7-provider rotation
**Expected Result**: 99% reduction in rate limit errors
**Recommendation**: Monitor for 24-48 hours, then decide on paid upgrade