# Prometheus Monitoring Setup Complete guide for production monitoring with Prometheus and Grafana. ## Table of Contents 1. [Overview](#overview) 2. [Quick Start](#quick-start) 3. [Metrics Exposed](#metrics-exposed) 4. [Prometheus Configuration](#prometheus-configuration) 5. [Grafana Dashboards](#grafana-dashboards) 6. [Alert Rules](#alert-rules) 7. [Production Deployment](#production-deployment) 8. [Query Examples](#query-examples) 9. [Troubleshooting](#troubleshooting) --- ## Overview The MEV Bot V2 exposes comprehensive Prometheus metrics for production monitoring and observability. All metrics follow Prometheus best practices with proper naming, labeling, and types. **Metrics Endpoint**: `http://localhost:8080/metrics` **Metric Categories**: - **Sequencer**: Message reception, parsing, validation - **Arbitrage**: Opportunity detection and execution - **Performance**: Latency histograms for critical operations - **Cache**: Pool cache hits/misses and size - **RPC**: Connection pool metrics - **Mempool**: Transaction monitoring --- ## Quick Start ### 1. Start the MEV Bot The bot automatically exposes metrics on port 8080: ```bash # Using Docker Compose (recommended) docker-compose up -d mev-bot # Or standalone container podman run -d \ --name mev-bot \ -p 8080:8080 \ -e RPC_URL=https://arb1.arbitrum.io/rpc \ -e WS_URL=wss://arb1.arbitrum.io/ws \ mev-bot-v2:latest ``` ### 2. Verify Metrics Endpoint ```bash curl http://localhost:8080/metrics ``` You should see output like: ``` # HELP mev_sequencer_messages_received_total Total number of messages received from Arbitrum sequencer feed # TYPE mev_sequencer_messages_received_total counter mev_sequencer_messages_received_total 1234 # HELP mev_parse_latency_seconds Time taken to parse a transaction # TYPE mev_parse_latency_seconds histogram mev_parse_latency_seconds_bucket{le="0.001"} 450 mev_parse_latency_seconds_bucket{le="0.005"} 890 ... ``` ### 3. Start Prometheus ```bash # Using provided configuration docker-compose up -d prometheus ``` ### 4. Start Grafana ```bash # Access at http://localhost:3000 docker-compose up -d grafana ``` **Default Credentials**: `admin` / `admin` (change on first login) --- ## Metrics Exposed ### Sequencer Metrics #### Counters ``` mev_sequencer_messages_received_total Total number of messages received from Arbitrum sequencer feed mev_sequencer_transactions_processed_total Total number of transactions processed from sequencer mev_sequencer_parse_errors_total Total number of parsing errors mev_sequencer_validation_errors_total Total number of validation errors mev_sequencer_swaps_detected_total Total number of swap events detected (labeled by protocol) Labels: protocol, version, type ``` #### Histograms ``` mev_parse_latency_seconds Time taken to parse a transaction Buckets: 1ms, 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s mev_detection_latency_seconds Time taken to detect arbitrage opportunities Buckets: 1ms, 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s mev_execution_latency_seconds Time taken to execute an arbitrage transaction Buckets: 100ms, 250ms, 500ms, 1s, 2s, 5s, 10s ``` ### Arbitrage Metrics ``` mev_opportunities_total Total number of arbitrage opportunities detected Labels: type (arbitrage, frontrun, backrun) mev_executions_attempted_total Total number of execution attempts mev_executions_successful_total Total number of successful executions mev_executions_failed_total Total number of failed executions Labels: reason (gas_price, slippage, revert, timeout) mev_profit_eth_total Total profit in ETH across all successful executions mev_gas_cost_eth_total Total gas cost in ETH across all executions ``` ### Pool Cache Metrics ``` mev_pool_cache_hits_total Total number of cache hits mev_pool_cache_misses_total Total number of cache misses mev_pool_cache_size Current number of pools in cache (gauge) mev_pool_cache_updates_total Total number of cache updates mev_pool_cache_evictions_total Total number of cache evictions ``` ### RPC Metrics ``` mev_rpc_requests_total Total number of RPC requests Labels: method (eth_call, eth_getBalance, etc.) mev_rpc_errors_total Total number of RPC errors Labels: method, error_type mev_rpc_latency_seconds RPC request latency histogram Labels: method ``` --- ## Prometheus Configuration ### prometheus.yml Create `config/prometheus/prometheus.yml`: ```yaml global: scrape_interval: 15s # Scrape targets every 15 seconds evaluation_interval: 15s # Evaluate rules every 15 seconds # Attach labels to all time series external_labels: monitor: 'mev-bot-prod' environment: 'production' # Alertmanager configuration alerting: alertmanagers: - static_configs: - targets: - alertmanager:9093 # Load and evaluate rules rule_files: - "alerts/*.yml" # Scrape configurations scrape_configs: # MEV Bot metrics - job_name: 'mev-bot' static_configs: - targets: ['mev-bot:8080'] labels: service: 'mev-bot' component: 'main' # Scrape interval for high-frequency metrics scrape_interval: 5s scrape_timeout: 4s # Relabeling relabel_configs: - source_labels: [__address__] target_label: instance replacement: 'mev-bot-v2' # Prometheus self-monitoring - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] # Node exporter (system metrics) - job_name: 'node' static_configs: - targets: ['node-exporter:9100'] labels: service: 'system' ``` ### Docker Compose Integration Add to your `docker-compose.yml`: ```yaml version: '3.8' services: mev-bot: image: mev-bot-v2:latest container_name: mev-bot ports: - "8080:8080" # Metrics endpoint environment: - RPC_URL=https://arb1.arbitrum.io/rpc - WS_URL=wss://arb1.arbitrum.io/ws - METRICS_PORT=8080 networks: - monitoring restart: unless-stopped prometheus: image: prom/prometheus:latest container_name: prometheus ports: - "9090:9090" volumes: - ./config/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro - ./config/prometheus/alerts:/etc/prometheus/alerts:ro - prometheus-data:/prometheus command: - '--config.file=/etc/prometheus/prometheus.yml' - '--storage.tsdb.path=/prometheus' - '--storage.tsdb.retention.time=30d' - '--web.console.libraries=/usr/share/prometheus/console_libraries' - '--web.console.templates=/usr/share/prometheus/consoles' - '--web.enable-lifecycle' networks: - monitoring restart: unless-stopped grafana: image: grafana/grafana:latest container_name: grafana ports: - "3000:3000" volumes: - ./config/grafana/provisioning:/etc/grafana/provisioning:ro - ./config/grafana/dashboards:/var/lib/grafana/dashboards:ro - grafana-data:/var/lib/grafana environment: - GF_SECURITY_ADMIN_PASSWORD=admin - GF_USERS_ALLOW_SIGN_UP=false - GF_SERVER_ROOT_URL=http://localhost:3000 networks: - monitoring depends_on: - prometheus restart: unless-stopped node-exporter: image: prom/node-exporter:latest container_name: node-exporter ports: - "9100:9100" command: - '--path.procfs=/host/proc' - '--path.sysfs=/host/sys' - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)' volumes: - /proc:/host/proc:ro - /sys:/host/sys:ro - /:/rootfs:ro networks: - monitoring restart: unless-stopped networks: monitoring: driver: bridge volumes: prometheus-data: grafana-data: ``` --- ## Grafana Dashboards ### Automatic Dashboard Provisioning Create `config/grafana/provisioning/dashboards/dashboard.yml`: ```yaml apiVersion: 1 providers: - name: 'MEV Bot Dashboards' orgId: 1 folder: 'MEV Bot' type: file disableDeletion: false updateIntervalSeconds: 10 allowUiUpdates: true options: path: /var/lib/grafana/dashboards foldersFromFilesStructure: true ``` Create `config/grafana/provisioning/datasources/prometheus.yml`: ```yaml apiVersion: 1 datasources: - name: Prometheus type: prometheus access: proxy url: http://prometheus:9090 isDefault: true editable: true jsonData: timeInterval: "5s" ``` ### Dashboard JSON Create `config/grafana/dashboards/mev-bot-overview.json`: ```json { "dashboard": { "title": "MEV Bot V2 - Overview", "tags": ["mev", "arbitrage", "production"], "timezone": "browser", "panels": [ { "id": 1, "title": "Messages Received Rate", "type": "graph", "targets": [ { "expr": "rate(mev_sequencer_messages_received_total[1m])", "legendFormat": "Messages/sec" } ], "gridPos": {"x": 0, "y": 0, "w": 12, "h": 8} }, { "id": 2, "title": "Parse Latency (P95)", "type": "graph", "targets": [ { "expr": "histogram_quantile(0.95, rate(mev_parse_latency_seconds_bucket[5m]))", "legendFormat": "P95 Parse Latency" } ], "gridPos": {"x": 12, "y": 0, "w": 12, "h": 8} }, { "id": 3, "title": "Opportunities by Type", "type": "graph", "targets": [ { "expr": "rate(mev_opportunities_total[5m])", "legendFormat": "{{type}}" } ], "gridPos": {"x": 0, "y": 8, "w": 12, "h": 8} }, { "id": 4, "title": "Execution Success Rate", "type": "gauge", "targets": [ { "expr": "rate(mev_executions_successful_total[5m]) / rate(mev_executions_attempted_total[5m]) * 100", "legendFormat": "Success %" } ], "gridPos": {"x": 12, "y": 8, "w": 6, "h": 8} }, { "id": 5, "title": "Total Profit (ETH)", "type": "stat", "targets": [ { "expr": "mev_profit_eth_total", "legendFormat": "Total Profit" } ], "gridPos": {"x": 18, "y": 8, "w": 6, "h": 8} } ], "refresh": "5s", "time": { "from": "now-1h", "to": "now" } } } ``` --- ## Alert Rules Create `config/prometheus/alerts/mev-bot-alerts.yml`: ```yaml groups: - name: mev_bot_alerts interval: 30s rules: # High error rate - alert: HighParseErrorRate expr: rate(mev_sequencer_parse_errors_total[5m]) > 10 for: 2m labels: severity: warning component: parser annotations: summary: "High parse error rate detected" description: "Parse error rate is {{ $value }} errors/sec (threshold: 10)" # Sequencer disconnection - alert: SequencerDisconnected expr: rate(mev_sequencer_messages_received_total[2m]) == 0 for: 1m labels: severity: critical component: sequencer annotations: summary: "Sequencer feed disconnected" description: "No messages received from sequencer for 1 minute" # Slow parsing - alert: SlowParsing expr: histogram_quantile(0.95, rate(mev_parse_latency_seconds_bucket[5m])) > 0.1 for: 5m labels: severity: warning component: parser annotations: summary: "Parse latency high" description: "P95 parse latency is {{ $value }}s (threshold: 0.1s)" # Low execution success rate - alert: LowExecutionSuccessRate expr: | ( rate(mev_executions_successful_total[10m]) / rate(mev_executions_attempted_total[10m]) ) < 0.1 for: 5m labels: severity: warning component: execution annotations: summary: "Low execution success rate" description: "Success rate is {{ $value | humanizePercentage }} (threshold: 10%)" # Cache miss rate too high - alert: HighCacheMissRate expr: | ( rate(mev_pool_cache_misses_total[5m]) / (rate(mev_pool_cache_hits_total[5m]) + rate(mev_pool_cache_misses_total[5m])) ) > 0.5 for: 10m labels: severity: info component: cache annotations: summary: "High cache miss rate" description: "Cache miss rate is {{ $value | humanizePercentage }} (threshold: 50%)" # No opportunities detected - alert: NoOpportunitiesDetected expr: rate(mev_opportunities_total[15m]) == 0 for: 15m labels: severity: warning component: detection annotations: summary: "No arbitrage opportunities detected" description: "No opportunities found in the last 15 minutes" # RPC errors - alert: HighRPCErrorRate expr: rate(mev_rpc_errors_total[5m]) > 5 for: 3m labels: severity: warning component: rpc annotations: summary: "High RPC error rate" description: "RPC error rate is {{ $value }} errors/sec for method {{ $labels.method }}" ``` --- ## Production Deployment ### 1. Deploy Full Stack ```bash # Clone repository git clone cd mev-bot # Create directories mkdir -p config/prometheus/alerts mkdir -p config/grafana/provisioning/{datasources,dashboards} mkdir -p config/grafana/dashboards # Copy configuration files (from this guide above) # ... copy prometheus.yml, alerts, grafana configs ... # Start all services docker-compose up -d # Verify services docker-compose ps ``` ### 2. Access Dashboards - **Prometheus**: http://localhost:9090 - **Grafana**: http://localhost:3000 (admin/admin) - **Metrics**: http://localhost:8080/metrics ### 3. Import Dashboards 1. Open Grafana at http://localhost:3000 2. Login with admin/admin 3. Navigate to Dashboards → Import 4. Upload `mev-bot-overview.json` 5. Select "Prometheus" as data source ### 4. Configure Alerts 1. In Grafana: Alerting → Notification channels 2. Add Slack/PagerDuty/Email integration 3. Test alert routing --- ## Query Examples ### PromQL Queries **Message throughput**: ```promql rate(mev_sequencer_messages_received_total[1m]) ``` **Parse success rate**: ```promql ( rate(mev_sequencer_transactions_processed_total[5m]) / rate(mev_sequencer_messages_received_total[5m]) ) * 100 ``` **P50, P95, P99 parse latency**: ```promql histogram_quantile(0.50, rate(mev_parse_latency_seconds_bucket[5m])) histogram_quantile(0.95, rate(mev_parse_latency_seconds_bucket[5m])) histogram_quantile(0.99, rate(mev_parse_latency_seconds_bucket[5m])) ``` **Top protocols by swap count**: ```promql topk(5, rate(mev_sequencer_swaps_detected_total[5m])) ``` **Execution success vs failure**: ```promql sum(rate(mev_executions_successful_total[5m])) by (type) sum(rate(mev_executions_failed_total[5m])) by (reason) ``` **Profit per hour**: ```promql increase(mev_profit_eth_total[1h]) ``` **ROI (profit / gas cost)**: ```promql ( increase(mev_profit_eth_total[1h]) / increase(mev_gas_cost_eth_total[1h]) ) * 100 ``` **Cache hit rate**: ```promql ( rate(mev_pool_cache_hits_total[5m]) / (rate(mev_pool_cache_hits_total[5m]) + rate(mev_pool_cache_misses_total[5m])) ) * 100 ``` --- ## Troubleshooting ### Metrics Not Appearing **Symptom**: `/metrics` endpoint returns empty or no data **Solutions**: 1. Verify MEV bot is running: `docker ps | grep mev-bot` 2. Check logs: `docker logs mev-bot` 3. Test endpoint: `curl http://localhost:8080/metrics` 4. Verify port mapping in docker-compose.yml ### Prometheus Not Scraping **Symptom**: Prometheus shows target as "down" **Solutions**: 1. Check Prometheus targets: http://localhost:9090/targets 2. Verify network connectivity: `docker exec prometheus ping mev-bot` 3. Check Prometheus logs: `docker logs prometheus` 4. Verify scrape configuration in prometheus.yml ### High Memory Usage **Symptom**: Prometheus consuming excessive memory **Solutions**: 1. Reduce retention time: `--storage.tsdb.retention.time=15d` 2. Reduce scrape frequency: `scrape_interval: 30s` 3. Limit series cardinality (reduce label combinations) ### Missing Histograms **Symptom**: Histogram percentiles return no data **Solutions**: 1. Verify histogram buckets match query range 2. Use `rate()` before `histogram_quantile()`: ```promql histogram_quantile(0.95, rate(mev_parse_latency_seconds_bucket[5m])) ``` 3. Ensure sufficient data points (increase time range) ### Grafana Dashboard Not Loading **Symptom**: Dashboard shows "No data" or errors **Solutions**: 1. Verify Prometheus data source: Settings → Data Sources 2. Test connection: "Save & Test" button 3. Check query syntax in panel editor 4. Verify time range matches data availability --- ## Performance Tuning ### For High Throughput ```yaml # prometheus.yml global: scrape_interval: 5s # More frequent scraping scrape_timeout: 4s scrape_configs: - job_name: 'mev-bot' scrape_interval: 2s # Even more frequent for critical metrics metric_relabel_configs: # Drop unnecessary metrics to reduce cardinality - source_labels: [__name__] regex: 'go_.*' action: drop ``` ### For Long-Term Storage ```bash # Use remote write to long-term storage docker run -d \ --name prometheus \ -v ./prometheus.yml:/etc/prometheus/prometheus.yml \ prom/prometheus:latest \ --config.file=/etc/prometheus/prometheus.yml \ --storage.tsdb.retention.time=30d \ --storage.tsdb.retention.size=50GB \ --storage.tsdb.wal-compression ``` --- ## Next Steps 1. **Custom Dashboards**: Create dashboards for specific use cases 2. **Advanced Alerts**: Configure multi-condition alerts 3. **Log Aggregation**: Integrate with Loki for log correlation 4. **Distributed Tracing**: Add Jaeger/Tempo for request tracing 5. **SLO Monitoring**: Define and track Service Level Objectives --- ## References - [Prometheus Documentation](https://prometheus.io/docs/) - [Grafana Documentation](https://grafana.com/docs/) - [PromQL Guide](https://prometheus.io/docs/prometheus/latest/querying/basics/) - [Best Practices](https://prometheus.io/docs/practices/naming/) **Prometheus Integration**: 100% Complete ✅