# Metrics & Telemetry Enhancement Plan ## Goal Expand metrics coverage for profitability, latency, and error conditions and ensure dashboards/alerts align with SRE expectations. ## Tasks 1. **Metric Inventory** - [ ] Catalogue existing metrics in `pkg/metrics/metrics.go` and identify gaps (profit factor, queue depth, RPC errors). - [ ] Ensure every critical subsystem records Prometheus metrics. 2. **Alerting & Dashboards** - [ ] Update Grafana dashboards to include new metrics; document recommended alert thresholds. - [ ] Integrate simulation outputs to set baseline expectations for hit rate and profit. 3. **Endpoint Hardening** - [ ] Validate authentication/IP allowlist for `/metrics` endpoint; support TLS/ingress integration. - [ ] Add health checks for metrics server (liveness/readiness signals). 4. **Documentation** - [ ] Extend `docs/6_operations/DEPLOYMENT_GUIDE.md` with monitoring instructions and alert-response runbooks. ## References - `pkg/metrics/metrics.go` - `monitoring/prometheus.yml`, Grafana configs