# Metrics & Telemetry Enhancement Plan

## Goal
Expand metrics coverage for profitability, latency, and error conditions and ensure dashboards/alerts align with SRE expectations.

## Tasks

1. **Metric Inventory**
   - [ ] Catalogue existing metrics in `pkg/metrics/metrics.go` and identify gaps (profit factor, queue depth, RPC errors).
   - [ ] Ensure every critical subsystem records Prometheus metrics.

2. **Alerting & Dashboards**
   - [ ] Update Grafana dashboards to include new metrics; document recommended alert thresholds.
   - [ ] Integrate simulation outputs to set baseline expectations for hit rate and profit.

3. **Endpoint Hardening**
   - [ ] Validate authentication/IP allowlist for `/metrics` endpoint; support TLS/ingress integration.
   - [ ] Add health checks for metrics server (liveness/readiness signals).

4. **Documentation**
   - [ ] Extend `docs/6_operations/DEPLOYMENT_GUIDE.md` with monitoring instructions and alert-response runbooks.

## References
- `pkg/metrics/metrics.go`
- `monitoring/prometheus.yml`, Grafana configs