Files
mev-beta/docs/RPC_MULTI_PROVIDER_SETUP_20251031.md

288 lines
7.5 KiB
Markdown

# Multi-Provider RPC Configuration - Rate Limit Solution
**Date**: October 31, 2025 19:00 CDT
**Status**: ✅ IMPLEMENTED - Multi-RPC rotation with failover
## Problem Statement
The MEV bot was experiencing critical rate limiting issues:
- **878 rate limit errors in 4 minutes** (~220/minute)
- 44.79% of pool fetch failures due to 429 errors
- Using single free RPC endpoint (https://arb1.arbitrum.io/rpc)
- Rate limit set too low (5 req/sec)
- No failover or rotation
## Solution Implemented
### 1. Multiple Diverse RPC Providers (7 Total)
Added 6 additional free/public Arbitrum RPC endpoints:
| Provider | Type | Endpoint | Rate Limit | Priority |
|----------|------|----------|------------|----------|
| Arbitrum Public HTTP | HTTP | https://arb1.arbitrum.io/rpc | 10 req/s | 10 |
| Arbitrum Public WS | WSS | wss://arb1.arbitrum.io/ws | 15 req/s | 10 |
| Chainlist RPC 1 | HTTP/WSS | arbitrum-one.publicnode.com | 12 req/s | 9 |
| Chainlist RPC 2 | HTTP/WSS | rpc.ankr.com/arbitrum | 12 req/s | 8 |
| Chainlist RPC 3 | HTTP/WSS | arbitrum.blockpi.network | 10 req/s | 7 |
| LlamaNodes | HTTP/WSS | arbitrum.llamarpc.com | 10 req/s | 6 |
| Alchemy Free | HTTP/WSS | arb-mainnet.g.alchemy.com | 15 req/s | 5 |
**Total Capacity**: ~84 requests/second across all providers
### 2. Round-Robin Rotation Strategy
Changed from `priority_based` to `round_robin` distribution:
```yaml
rotation:
strategy: round_robin # Distribute load evenly
fallback_enabled: true
health_check_required: true
retry_failed_after: 2m
auto_rotate_interval: 30s # Rotate every 30 seconds
failover_on_rate_limit: true # Immediate switch on 429
```
### 3. Intelligent Failover
**Circuit Breaker Configuration**:
```yaml
circuit_breaker:
enabled: true
failure_threshold: 5 # Switch after 5 failures
success_threshold: 2 # Re-enable after 2 successes
timeout: 60s
half_open_requests: 3
```
**Retry Logic**:
- Max attempts: 5 (across different providers)
- Exponential backoff: 500ms → 1s → 2s → 4s → 5s
- Jitter enabled to prevent thundering herd
### 4. Provider Pools
**Execution Pool** (for transactions):
- 4 providers with highest reliability
- Strategy: round_robin
- Max connections: 15
**Read-Only Pool** (for DataFetcher):
- 6 providers for maximum capacity
- Strategy: round_robin
- Max connections: 20
### 5. Enhanced Rate Limits
Increased per-provider limits:
- HTTP endpoints: 10-12 req/s (was 5)
- WSS endpoints: 15 req/s (was 5)
- Burst capacity: 25-40 (was 10)
- Timeout: 45-60s (was 30s)
## Expected Impact
### Before (Single Provider)
```
Provider: Arbitrum Public HTTP only
Rate Limit: 5 req/s
429 Errors: 878 in 4 minutes
Success Rate: ~55% (44.79% rate limit failures)
```
### After (Multi-Provider)
```
Providers: 7 diverse endpoints
Total Capacity: ~84 req/s (16.8x increase)
Expected 429 Errors: <10 per hour (99% reduction)
Expected Success Rate: >98%
```
## Configuration Files
### Primary Config
- **File**: `config/providers_runtime.yaml`
- **Backup**: `config/providers_runtime.yaml.backup_single_rpc_*`
### Key Changes
1. Added 6 new providers
2. Increased rate limits by 2-3x per provider
3. Changed strategy from priority_based → round_robin
4. Added circuit breaker and retry configuration
5. Enabled automatic rotation every 30s
## Testing Results
### Build Status
✅ Build successful with new configuration
### Startup Test
✅ Bot started successfully
✅ All providers initialized
✅ Round-robin rotation active
### Next Steps
1. Run 2-hour test to verify 429 error reduction
2. Monitor provider health and rotation
3. Fine-tune rate limits based on actual usage
4. Consider upgrading to paid tier if free limits still exceeded
## Provider Selection Criteria
All providers selected based on:
1. **Reliability**: Listed on Chainlist or official docs
2. **Free Tier**: No API key required for basic use
3. **Rate Limits**: Reasonable limits for testing
4. **Geographic Diversity**: Different infrastructure providers
5. **Both HTTP & WSS**: Support for both protocols
## Fallback Strategy
If all free providers hit rate limits:
1. Circuit breaker will open
2. Retry with exponential backoff
3. Wait 60s before retry
4. Log alert for manual intervention
**Recommended**: Upgrade to paid tier ($100-300/month) if sustained high load
## Monitoring
Monitor these metrics:
```bash
# Check provider health
tail -f logs/mev-bot.log | grep -i "provider\|rotation\|429"
# Count rate limit errors
grep -c "429 Too Many Requests" logs/mev-bot_errors.log
# Check which providers are being used
grep "Using provider" logs/mev-bot.log | tail -20
```
## Cost Analysis
### Current (Free Tier)
- Cost: $0/month
- Capacity: ~84 req/s total
- Limitations: Subject to rate limits
### Recommended (Paid Tier)
- Provider: Alchemy/Infura/QuickNode
- Cost: $100-300/month
- Capacity: 300-1000+ req/s
- Benefits:
- Dedicated capacity
- Higher reliability
- Better SLAs
- Archive node access
## Implementation Notes
### Code Changes Required
None - existing provider manager already supports:
- ✅ Round-robin rotation
- ✅ Health checks
- ✅ Failover
- ✅ Circuit breaker
- ✅ Retry logic
### Configuration Changes Only
All improvements achieved through YAML configuration updates only.
## Rollback Procedure
If issues occur:
```bash
# Restore previous config
cp config/providers_runtime.yaml.backup_single_rpc_* config/providers_runtime.yaml
# Rebuild
./scripts/build.sh
# Restart bot
./bin/mev-beta start
```
## Success Metrics
Track these KPIs:
1. **Rate Limit Errors**: Should drop from 878/4min to <10/hour
2. **Pool Fetch Success Rate**: Should increase from 55% to >98%
3. **Provider Health**: All 7 providers should maintain >90% uptime
4. **Rotation**: Should rotate between providers every 30s
5. **Response Time**: Should average <150ms (was ~105ms)
## Documentation References
- Provider Manager: `pkg/transport/provider_manager.go`
- Failover Logic: `pkg/transport/failover.go`
- Provider Pools: `pkg/transport/provider_pools.go`
- Rate Limiting: `pkg/arbitrum/rate_limited_rpc.go`
---
## Appendix: Provider Details
### Free Public Arbitrum RPC Endpoints
1. **Official Arbitrum**
- HTTP: https://arb1.arbitrum.io/rpc
- WSS: wss://arb1.arbitrum.io/ws
- Limit: ~10 req/s
- Reliability: High (official)
2. **PublicNode (Chainlist)**
- HTTP/WSS: arbitrum-one.publicnode.com
- Limit: ~12 req/s
- Reliability: High
- Features: Both HTTP and WSS
3. **Ankr (Chainlist)**
- HTTP/WSS: rpc.ankr.com/arbitrum
- Limit: ~12 req/s
- Reliability: High
- Features: Professional infrastructure
4. **BlockPI (Chainlist)**
- HTTP/WSS: arbitrum.blockpi.network
- Limit: ~10 req/s
- Reliability: Medium-High
- Features: Public access
5. **LlamaNodes**
- HTTP/WSS: arbitrum.llamarpc.com
- Limit: ~10 req/s
- Reliability: Medium
- Features: Community-maintained
6. **Alchemy Free Tier**
- HTTP/WSS: arb-mainnet.g.alchemy.com/v2/demo
- Limit: ~15 req/s
- Reliability: High
- Features: Demo key (upgrade available)
### Recommended Paid Providers
For production use:
1. **Alchemy** ($49-299/month)
- 300M+ compute units
- Archive node access
- Enhanced APIs
2. **Infura** ($50-225/month)
- 100K-1M+ requests/day
- Reliable infrastructure
- Good documentation
3. **QuickNode** ($49-299/month)
- Dedicated nodes
- Global coverage
- Premium support
---
**Status**: Ready for production testing with 7-provider rotation
**Expected Result**: 99% reduction in rate limit errors
**Recommendation**: Monitor for 24-48 hours, then decide on paid upgrade