mev-beta/docs/MONITORING_PRODUCTION_READINESS.md

# MEV Bot Monitoring & Metrics Infrastructure Survey
## Production Deployment Gap Analysis

**Date:** October 23, 2025
**Status:** Medium Thoroughness Assessment
**Overall Readiness:** 65% (Moderate Gaps Identified)

---

## Executive Summary

The MEV bot has **substantial monitoring infrastructure** with custom health checking, data integrity monitoring, and basic metrics exposure. However, **critical production gaps exist** in:
- Prometheus-standard metrics export
- Distributed tracing/observability
- Kubernetes-native probes (readiness/liveness/startup)
- SLO/SLA frameworks
- Production-grade performance profiling

---

## 1. Metrics Collection & Export

### CURRENT IMPLEMENTATION: 65% Complete

#### What's Working:
- **Custom Metrics Server** (`pkg/metrics/metrics.go`)
  - JSON metrics endpoint at `/metrics`
  - Manual Prometheus format at `/metrics/prometheus`
  - Business metrics: L2 messages, arbitrage opportunities, trades, profits
  - Auth-protected endpoints (127.0.0.1 only)
  - Port configurable via `METRICS_PORT` env var (default: 9090)

- **Metrics Collected:**
  - L2 message processing (rate, lag)
  - DEX interactions & swap opportunities
  - Trade success/failure rates & profits
  - Gas costs and profit factors
  - System uptime in seconds

#### Critical GAPS:
- **NO Prometheus client library integration** - Manual text formatting instead of `prometheus/client_golang`
- **NO histogram/distribution metrics** - Only point-in-time values
- **NO custom metric registration** - Cannot add new metrics without modifying core code
- **NO metric cardinality control** - Risk of metric explosion
- **NO scrape-friendly format validation** - Manual string concatenation prone to syntax errors
- **NO metrics retention** - Snapshots lost on restart
- **NO dimensional/labeled metrics** - Cannot slice data by operation, module, or error type

#### Recommendation:
```go
import "github.com/prometheus/client_golang/prometheus"
import "github.com/prometheus/client_golang/prometheus/promhttp"

// Replace manual metrics with Prometheus client
var (
    l2MessagesTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{Name: "mev_bot_l2_messages_total"},
        []string{"status"},
    )
    processingLatency = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name: "mev_bot_processing_latency_seconds",
            Buckets: prometheus.DefBuckets,
        },
        []string{"stage"},
    )
)

// Serve at /metrics using promhttp.Handler()
```

---

## 2. Performance Monitoring & Profiling Hooks

### CURRENT IMPLEMENTATION: 50% Complete

#### What's Working:
- **Performance Profiler** (`pkg/security/performance_profiler.go`)
  - Operation-level performance tracking
  - Resource usage monitoring (heap, goroutines, GC)
  - Performance classification (excellent/good/average/poor/critical)
  - Alert generation for threshold violations
  - Comprehensive report generation with bottleneck analysis
  - **2,000+ lines of detailed profiling infrastructure**

- **Profiling Features:**
  - Min/max/avg response times per operation
  - Error rate tracking and trend analysis
  - Memory efficiency scoring
  - CPU efficiency calculation
  - GC efficiency metrics
  - Recommendations for optimization

#### Critical GAPS:
- **NO pprof integration** - Cannot attach to `net/http/pprof` for live profiling
- **NO CPU profiling endpoint** - No `/debug/pprof/profile` available
- **NO memory profiling endpoint** - No heap dump capability
- **NO goroutine profiling** - Cannot inspect goroutine stacks
- **NO flamegraph support** - No integration with go-torch or pprof web UI
- **NO continuous profiling** - Manual operation tracking only
- **NO profile persistence** - Reports generated in-memory, not saved

#### Implementation Status:
```go
// Currently exists:
pp := NewPerformanceProfiler(logger, config)
tracker := pp.StartOperation("my_operation")
// ... do work ...
tracker.End()
report, _ := pp.GenerateReport()  // Custom report

// MISSING:
import _ "net/http/pprof"
// http://localhost:6060/debug/pprof/profile?seconds=30
// http://localhost:6060/debug/pprof/heap
// http://localhost:6060/debug/pprof/goroutine
```

#### Recommendation:
```go
import _ "net/http/pprof"

// In main startup
go func() {
    log.Info("pprof server starting", "addr", ":6060")
    log.Error("pprof error", "err", http.ListenAndServe(":6060", nil))
}()

// Now supports:
// curl http://localhost:6060/debug/pprof/ - profile index
// go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30 - CPU
// go tool pprof http://localhost:6060/debug/pprof/heap - Memory
// go tool pprof http://localhost:6060/debug/pprof/goroutine - Goroutines
```

---

## 3. Real-Time Alerting & Dashboard Systems

### CURRENT IMPLEMENTATION: 75% Complete

#### What's Working:
- **Alert Handlers** (`internal/monitoring/alert_handlers.go`)
  - Log-based alerts (structured logging)
  - File-based alerts (JSON JSONL format)
  - HTTP webhook support (Slack, Discord, generic)
  - Metrics-based counters
  - Composite handler pattern (multiple handlers)
  - Automatic webhook type detection (Slack vs Discord)
  - Retry logic with exponential backoff (3 retries)

- **Monitoring Dashboard** (`internal/monitoring/dashboard.go`)
  - HTML5 responsive dashboard at port 8080 (configurable)
  - Real-time health metrics display
  - Auto-refresh every 30 seconds
  - JSON API endpoints:
    - `/api/health` - current health status
    - `/api/metrics` - system metrics
    - `/api/history?count=N` - historical snapshots
    - `/api/alerts?limit=N` - alert records
  - Integrity monitoring metrics display
  - Performance classification cards
  - Recovery action tracking

#### Critical GAPS:
- **NO email alerts** - Only webhooks/logging
- **NO SMS/Slack bot integration** - Only generic webhooks
- **NO alert aggregation/deduplication** - Every alert fires independently
- **NO alert silencing/suppression** - Cannot silence known issues
- **NO correlation/grouping** - Similar alerts not grouped
- **NO escalation policies** - No severity-based notification routing
- **NO PagerDuty integration** - Cannot create incidents
- **NO dashboard persistence** - Metrics reset on restart
- **NO multi-user access control** - No RBAC for dashboard
- **NO alert acknowledgment/tracking** - No alert lifecycle management
- **NO custom dashboard widgets** - Fixed layout

#### Recommendation (Priority: HIGH):
```go
// Add alert correlation and deduplication
type AlertManager struct {
    alerts      map[string]*Alert  // deduplicated by fingerprint
    suppressions map[string]time.Time
    escalations  map[AlertSeverity][]Handler
}

// Add PagerDuty integration
import "github.com/PagerDuty/go-pagerduty"

// Add email support
import "net/smtp"

// Implement alert lifecycle
type Alert struct {
    ID          string
    Status      AlertStatus  // TRIGGERED, ACKNOWLEDGED, RESOLVED
    AckTime     time.Time
    Resolution  string
}
```

---

## 4. Health Check & Readiness Probe Implementations

### CURRENT IMPLEMENTATION: 55% Complete

#### What's Working:
- **Metrics Health Endpoint** (`pkg/metrics/metrics.go`)
  - `/health` endpoint returns JSON status
  - Updates last_health_check timestamp
  - Returns HTTP 200 when healthy
  - Simple liveness indicator

- **Lifecycle Health Monitor** (`pkg/lifecycle/health_monitor.go`)
  - Comprehensive module health tracking
  - Parallel/sequential health checks
  - Check timeout enforcement
  - Health status aggregation
  - Health trends (improving/stable/degrading/critical)
  - Notification on status changes
  - Configurable failure/recovery thresholds
  - 1,000+ lines of health management

- **Integrity Health Runner** (`internal/monitoring/health_checker.go`)
  - Periodic health checks (30s default)
  - Health history tracking (last 100 snapshots)
  - Corruption rate monitoring
  - Validation success tracking
  - Contract call success rate
  - Health trend calculation
  - Warm-up suppression (prevents early false alerts)

#### Critical GAPS:
- **NO Kubernetes liveness probe format** - Only JSON, not Kubernetes-compatible
- **NO startup probe** - Cannot detect initialization delays
- **NO readiness probe** - Cannot detect degraded-but-running state
- **NO individual service probes** - Cannot probe individual modules
- **NO external health check integration** - Only self-checks
- **NO health check history export** - Cannot retrieve past health data
- **NO SLO-based health thresholds** - Thresholds hardcoded
- **NO health events/timestamps** - Only current status
- **NO health check dependencies** - Cannot define "Module A healthy only if Module B healthy"

#### Kubernetes Probes Implementation (CRITICAL MISSING):

The application is **NOT Kubernetes-ready** without these probes:

```yaml
# MISSING CONFIG - Must be added to deployment
livenessProbe:
  httpGet:
    path: /health/live      # NOT IMPLEMENTED
    port: 9090
  initialDelaySeconds: 30
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /health/ready     # NOT IMPLEMENTED
    port: 9090
  initialDelaySeconds: 5
  periodSeconds: 5

startupProbe:
  httpGet:
    path: /health/startup   # NOT IMPLEMENTED
    port: 9090
  failureThreshold: 30
  periodSeconds: 10
```

#### Recommendation (Priority: CRITICAL):
```go
// Add Kubernetes-compatible probes
func handleLivenessProbe(w http.ResponseWriter, r *http.Request) {
    // Return 200 if process is still running
    // Return 500 if deadlock detected (watchdog timeout)
    status := "ok"
    if time.Since(lastHeartbeat) > 2*time.Minute {
        w.WriteHeader(http.StatusInternalServerError)
        w.Write([]byte(`{"status": "deadlock_detected"}`))
        return
    }
    w.WriteHeader(http.StatusOK)
    w.Write([]byte(`{"status": "` + status + `"}`))
}

func handleReadinessProbe(w http.ResponseWriter, r *http.Request) {
    // Return 200 only if:
    // 1. RPC connection healthy
    // 2. Database healthy
    // 3. All critical services initialized
    // 4. No degraded conditions
    if !isReady() {
        w.WriteHeader(http.StatusServiceUnavailable)
        w.Write([]byte(`{"status": "not_ready", "reason": "..."`))
        return
    }
    w.WriteHeader(http.StatusOK)
    w.Write([]byte(`{"status": "ready"}`))
}

func handleStartupProbe(w http.ResponseWriter, r *http.Request) {
    // Return 200 only after initialization complete
    // Can take up to 5 minutes for complex startup
    if !isInitialized {
        w.WriteHeader(http.StatusServiceUnavailable)
        return
    }
    w.WriteHeader(http.StatusOK)
}
```

---

## 5. Production Logging & Audit Trail Systems

### CURRENT IMPLEMENTATION: 80% Complete

#### What's Working:
- **Structured Logging** (`internal/logger/logger.go`)
  - Slog integration with JSON/text formats
  - Multiple log levels (debug, info, warn, error)
  - File output with configurable paths
  - Configurable via environment
  - Key-value pair support

- **Secure Filter** (`internal/logger/secure_filter.go`)
  - Filters sensitive data (keys, passwords, secrets)
  - Prevents credential leaks in logs
  - Pattern-based filtering
  - Audit-safe output

- **Audit Logging** (`internal/logger/secure_audit.go`)
  - Security event logging
  - Transaction tracking
  - Key operation audit trail
  - Configurable audit log file

- **Log Rotation (External)**
  - `scripts/log-manager.sh` - comprehensive log management
  - Archiving with compression
  - Health monitoring
  - Real-time analysis
  - Performance tracking
  - Daemon monitoring

#### Critical GAPS:
- **NO audit log integrity verification** - Cannot verify logs haven't been tampered
- **NO log aggregation client** - No ELK/Splunk/Datadog shipping
- **NO log correlation IDs** - Cannot trace requests across services
- **NO log rate limiting** - Could be DoS'd with logs
- **NO structured event schema validation** - Logs could have inconsistent structure
- **NO log retention policies** - Manual cleanup only
- **NO encrypted log storage** - Audit logs unencrypted at rest
- **NO log signing/verification** - No cryptographic integrity proof
- **NO compliance logging** (GDPR/SOC2)** - No data residency controls

#### Recommendation (Priority: MEDIUM):
```go
// Add correlation IDs
type RequestContext struct {
    CorrelationID string  // Unique per request
    UserID        string
    Timestamp     time.Time
}

// Add structured audit events
type AuditEvent struct {
    EventType   string                 `json:"event_type"`
    Timestamp   time.Time              `json:"timestamp"`
    Actor       string                 `json:"actor"`
    Resource    string                 `json:"resource"`
    Action      string                 `json:"action"`
    Result      string                 `json:"result"`  // success/failure
    Reason      string                 `json:"reason"`  // if failure
    CorrelationID string                `json:"correlation_id"`
}

// Add log forwarding
import "github.com/fluent/fluent-logger-golang/fluent"
```

---

## Summary Table: Monitoring Readiness

| Component | Implemented | Gaps | Priority | Grade |
|-----------|--------------|------|----------|-------|
| **Metrics Export** | JSON + basic Prometheus | No prometheus/client_golang, no cardinality control | HIGH | C+ |
| **Performance Profiling** | Custom profiler, no pprof | No /debug/pprof endpoints, no live profiling | HIGH | C |
| **Alerting** | Webhooks + logging | No dedup, no escalation, no PagerDuty | MEDIUM | B- |
| **Dashboards** | HTML5 real-time | No persistence, no RBAC, no widgets | MEDIUM | B |
| **Health Checks** | Lifecycle + integrity | No K8s probes, no readiness/liveness/startup | CRITICAL | D+ |
| **Logging** | Structured + secure | No correlation IDs, no aggregation, no integrity | MEDIUM | B |
| **Overall** | **65% coverage** | **Critical K8s gaps, incomplete observability** | **HIGH** | **C+** |

---

## Implementation Priority Matrix

### PHASE 1: Critical (Must have for production K8s)
- [ ] **Add Kubernetes probe handlers** (readiness/liveness/startup) - 3 hours
- [ ] **Integrate prometheus/client_golang** - 4 hours
- [ ] **Add pprof endpoints** - 1 hour
- [ ] **Implement alert deduplication** - 2 hours
- **Total: 10 hours** - Enables Kubernetes deployment

### PHASE 2: High (Production monitoring)
- [ ] **Add correlation IDs** to logging - 3 hours
- [ ] **Implement log aggregation** (Fluent/Datadog) - 4 hours
- [ ] **Add PagerDuty integration** - 2 hours
- [ ] **Implement alert silencing** - 2 hours
- [ ] **Add metrics retention/export** - 3 hours
- **Total: 14 hours** - Production-grade observability

### PHASE 3: Medium (Hardening)
- [ ] **Add audit log integrity verification** - 3 hours
- [ ] **Implement log encryption at rest** - 2 hours
- [ ] **Add SLO/SLA framework** - 4 hours
- [ ] **Implement health check dependencies** - 2 hours
- [ ] **Add dashboard persistence** - 2 hours
- **Total: 13 hours** - Enterprise-grade logging

---

## Critical Files to Modify

```
HIGHEST PRIORITY:
├── cmd/mev-bot/main.go              (Add K8s probes, pprof, Prometheus)
├── pkg/metrics/metrics.go            (Replace with prometheus/client_golang)
└── internal/monitoring/alert_handlers.go  (Add deduplication)

HIGH PRIORITY:
├── internal/monitoring/integrity_monitor.go  (Add correlation IDs)
├── internal/logger/logger.go         (Add aggregation)
└── pkg/lifecycle/health_monitor.go   (Add probe handling)

MEDIUM PRIORITY:
├── pkg/security/performance_profiler.go  (Integrate pprof)
└── internal/monitoring/dashboard.go  (Add persistence)
```

---

## Configuration Examples

```yaml
# Missing environment variables for production
METRICS_ENABLED: "true"
METRICS_PORT: "9090"
HEALTH_CHECK_INTERVAL: "30s"
PROMETHEUS_SCRAPE_INTERVAL: "15s"
LOG_AGGREGATION_ENABLED: "true"
LOG_AGGREGATION_ENDPOINT: "https://logs.datadog.com"
PAGERDUTY_API_KEY: "${PAGERDUTY_API_KEY}"
PAGERDUTY_SERVICE_ID: "${PAGERDUTY_SERVICE_ID}"
AUDIT_LOG_ENCRYPTION_KEY: "${AUDIT_LOG_ENCRYPTION_KEY}"
```

---

## Kubernetes Deployment Readiness Checklist

- [ ] Liveness probe implemented
- [ ] Readiness probe implemented
- [ ] Startup probe implemented
- [ ] Prometheus metrics at /metrics
- [ ] Health checks in separate port (9090)
- [ ] Graceful shutdown (SIGTERM handling)
- [ ] Resource requests/limits configured
- [ ] Pod disruption budgets defined
- [ ] Log aggregation configured
- [ ] Alert routing configured
- [ ] SLOs defined and monitored
- [ ] Disaster recovery tested

**Current Status: 3/12 (25% K8s ready)**

---

## Files Analyzed

### Core Monitoring (550+ lines)
- `internal/monitoring/dashboard.go` (550 lines) - HTML dashboard
- `internal/monitoring/alert_handlers.go` (400 lines) - Alert system
- `internal/monitoring/health_checker.go` (448 lines) - Health checks
- `internal/monitoring/integrity_monitor.go` (500+ lines) - Data integrity

### Performance & Lifecycle (2000+ lines)
- `pkg/security/performance_profiler.go` (1300 lines) - Comprehensive profiler
- `pkg/lifecycle/health_monitor.go` (1000+ lines) - Lifecycle management
- `pkg/metrics/metrics.go` (415 lines) - Basic metrics collection

### Conclusion
The MEV bot has **solid foundational monitoring** but requires **significant enhancements** for production Kubernetes deployment, particularly around Kubernetes-native probes and Prometheus integration.