# LOW-003: Monitoring & Observability - Detailed Fix Plan **Issue ID:** LOW-003 **Category:** Observability **Priority:** Low **Status:** Not Started **Generated:** October 9, 2025 **Estimate:** 6-8 hours ## Overview This plan implements comprehensive monitoring and observability features including security event metrics, anomaly detection for unusual transaction patterns, security audit log analysis tools, and performance monitoring for security operations. The goal is to enhance visibility into system security and performance. ## Current Implementation Issues - Lack of security event metrics and dashboards - No anomaly detection for unusual transaction patterns - Missing security audit log analysis tools - Absence of performance monitoring for security operations ## Implementation Tasks ### 1. Add Security Event Metrics and Dashboards **Task ID:** LOW-003.1 **Time Estimate:** 1.5 hours **Dependencies:** None Implement comprehensive security event metrics and visualization: - Track security-relevant events (failed authentications, blocked transactions, etc.) - Create Prometheus metrics for security events - Design Grafana dashboards for security monitoring - Implement alerting for security metric thresholds ```go import ( "github.com/prometheus/client_golang/prometheus" "github.com/prometheus/client_golang/prometheus/promauto" ) var ( // Security-related metrics securityEvents = promauto.NewCounterVec( prometheus.CounterOpts{ Name: "security_events_total", Help: "Total number of security events by type", }, []string{"event_type", "severity"}, ) rateLimitExceeded = promauto.NewCounterVec( prometheus.CounterOpts{ Name: "rate_limit_exceeded_total", Help: "Total number of rate limit exceeded events by endpoint", }, []string{"endpoint"}, ) failedAuthentications = promauto.NewCounterVec( prometheus.CounterOpts{ Name: "failed_authentications_total", Help: "Total number of failed authentication attempts by source", }, []string{"source", "reason"}, ) blockedTransactions = promauto.NewCounterVec( prometheus.CounterOpts{ Name: "blocked_transactions_total", Help: "Total number of blocked transactions by reason", }, []string{"reason", "chain_id"}, ) securityOperationDuration = promauto.NewHistogramVec( prometheus.HistogramOpts{ Name: "security_operation_duration_seconds", Help: "Duration of security operations", Buckets: prometheus.DefBuckets, }, []string{"operation", "status"}, ) ) // Example usage in security functions func ValidateTransaction(tx *Transaction) error { start := time.Now() defer func() { duration := time.Since(start) securityOperationDuration.WithLabelValues("transaction_validation", "completed").Observe(duration.Seconds()) }() // Validation logic here if err := validateNonce(tx); err != nil { blockedTransactions.WithLabelValues("invalid_nonce", tx.ChainId().String()).Inc() return err } if err := validateGasLimit(tx); err != nil { blockedTransactions.WithLabelValues("gas_limit_exceeded", tx.ChainId().String()).Inc() return err } return nil } // Example for rate limiting func (rl *RateLimiter) Allow(key string) bool { start := time.Now() defer func() { duration := time.Since(start) securityOperationDuration.WithLabelValues("rate_limit_check", "completed").Observe(duration.Seconds()) }() allowed := rl.impl.Allow(key) if !allowed { rateLimitExceeded.WithLabelValues(extractEndpoint(key)).Inc() } return allowed } ``` ### 2. Implement Anomaly Detection for Unusual Transaction Patterns **Task ID:** LOW-003.2 **Time Estimate:** 2 hours **Dependencies:** LOW-003.1 Create anomaly detection system for identifying unusual transaction patterns: - Analyze transaction frequency, amounts, and patterns - Implement statistical models for baseline behavior - Detect potential MEV attacks or unusual activity - Generate alerts for detected anomalies ```go import ( "time" "math" ) type AnomalyDetector struct { metrics *MetricsClient alertSystem *AlertSystem baselines map[string]*BaselineProfile mu sync.RWMutex windowSize time.Duration // Time window for pattern analysis } type BaselineProfile struct { avgTransactions float64 // Average transactions per window stdDev float64 // Standard deviation recentValues []float64 // Recent values for trend analysis lastUpdated time.Time } type AnomalyEvent struct { Timestamp time.Time `json:"timestamp"` Type string `json:"type"` Severity string `json:"severity"` Description string `json:"description"` Context interface{} `json:"context"` Score float64 `json:"score"` // 0.0-1.0 anomaly score } func NewAnomalyDetector(metrics *MetricsClient, alertSystem *AlertSystem) *AnomalyDetector { return &AnomalyDetector{ metrics: metrics, alertSystem: alertSystem, baselines: make(map[string]*BaselineProfile), windowSize: 1 * time.Hour, } } func (ad *AnomalyDetector) AnalyzeTransactionPattern(tx *Transaction, accountAddress string) { // Gather transaction statistics currentRate := ad.getCurrentTransactionRate(accountAddress) baseline, exists := ad.baselines[accountAddress] if !exists { ad.updateBaseline(accountAddress, currentRate) return } // Calculate z-score to determine anomaly level zScore := math.Abs(currentRate-baseline.avgTransactions) / baseline.stdDev // If z-score exceeds threshold, consider it an anomaly if zScore > 3.0 { // Using 3 standard deviations as threshold ad.reportAnomaly(&AnomalyEvent{ Timestamp: time.Now(), Type: "transaction_rate_anomaly", Severity: ad.getSeverity(zScore), Description: fmt.Sprintf("Unusual transaction rate detected: %.2fx baseline for account %s", currentRate/baseline.avgTransactions, accountAddress), Context: map[string]interface{}{ "account_address": accountAddress, "current_rate": currentRate, "baseline_rate": baseline.avgTransactions, "z_score": zScore, }, Score: zScore / 10.0, // Normalize to 0-1 scale }) } // Update baseline for next analysis ad.updateBaseline(accountAddress, currentRate) } func (ad *AnomalyDetector) getCurrentTransactionRate(accountAddress string) float64 { // Query metrics to get transaction count in recent window // This would typically come from a metrics backend like Prometheus query := fmt.Sprintf( `increase(transaction_count_total{account="%s"}[1h])`, accountAddress, ) result, err := ad.metrics.Query(query) if err != nil { // Log error but don't fail the detection return 0 } if len(result) > 0 { return result[0].Value } return 0 } func (ad *AnomalyDetector) updateBaseline(accountAddress string, newValue float64) { ad.mu.Lock() defer ad.mu.Unlock() baseline, exists := ad.baselines[accountAddress] if !exists { // Initialize new baseline ad.baselines[accountAddress] = &BaselineProfile{ avgTransactions: newValue, stdDev: 0, recentValues: []float64{newValue}, lastUpdated: time.Now(), } return } // Update rolling average and standard deviation baseline.recentValues = append(baseline.recentValues, newValue) // Keep only last N values for rolling calculation maxHistory := 24 // last 24 hours worth of data if len(baseline.recentValues) > maxHistory { baseline.recentValues = baseline.recentValues[len(baseline.recentValues)-maxHistory:] } // Recalculate baseline statistics baseline.avgTransactions = ad.calculateMean(baseline.recentValues) baseline.stdDev = ad.calculateStdDev(baseline.recentValues, baseline.avgTransactions) baseline.lastUpdated = time.Now() } func (ad *AnomalyDetector) calculateMean(values []float64) float64 { if len(values) == 0 { return 0 } sum := 0.0 for _, v := range values { sum += v } return sum / float64(len(values)) } func (ad *AnomalyDetector) calculateStdDev(values []float64, mean float64) float64 { if len(values) <= 1 { return 0 } sum := 0.0 for _, v := range values { diff := v - mean sum += diff * diff } variance := sum / float64(len(values)-1) return math.Sqrt(variance) } func (ad *AnomalyDetector) reportAnomaly(event *AnomalyEvent) { // Log the anomaly log.WithFields(log.Fields{ "timestamp": event.Timestamp, "type": event.Type, "severity": event.Severity, "score": event.Score, }).Warn("Anomaly detected: " + event.Description) // Send to metrics system anomalyScore.WithLabelValues(event.Type, event.Severity).Set(event.Score) // Trigger alert if severity is high enough if ad.shouldAlert(event.Severity) { ad.alertSystem.SendAlert("Security Anomaly Detected", map[string]interface{}{ "event": event, }) } } func (ad *AnomalyDetector) getSeverity(score float64) string { switch { case score > 5.0: return "critical" case score > 3.0: return "high" case score > 2.0: return "medium" default: return "low" } } func (ad *AnomalyDetector) shouldAlert(severity string) bool { return severity == "critical" || severity == "high" } ``` ### 3. Create Security Audit Log Analysis Tools **Task ID:** LOW-003.3 **Time Estimate:** 1.5 hours **Dependencies:** LOW-003.1 Develop tools for analyzing security audit logs: - Create parsers for security-relevant log entries - Implement aggregation and analysis functions - Build summary reports for security events - Create search and filtering capabilities ```go import ( "encoding/json" "time" "strings" "regexp" ) type SecurityAuditLogAnalyzer struct { logParser *LogParser storage StorageBackend } type SecurityEvent struct { Timestamp time.Time `json:"timestamp"` Level string `json:"level"` Message string `json:"message"` Fields map[string]interface{} `json:"fields"` Source string `json:"source"` Category string `json:"category"` } type SecurityReport struct { PeriodStart time.Time `json:"period_start"` PeriodEnd time.Time `json:"period_end"` TotalEvents int `json:"total_events"` CriticalEvents int `json:"critical_events"` ByCategory map[string]int `json:"by_category"` BySeverity map[string]int `json:"by_severity"` Anomalies []AnomalySummary `json:"anomalies"` } type AnomalySummary struct { Type string `json:"type"` Count int `json:"count"` FirstSeen time.Time `json:"first_seen"` LastSeen time.Time `json:"last_seen"` Description string `json:"description"` } func NewSecurityAuditLogAnalyzer(storage StorageBackend) *SecurityAuditLogAnalyzer { return &SecurityAuditLogAnalyzer{ logParser: NewLogParser(), storage: storage, } } func (sala *SecurityAuditLogAnalyzer) ParseSecurityEvents(logFile string, startDate, endDate time.Time) ([]*SecurityEvent, error) { var events []*SecurityEvent logEntries, err := sala.logParser.ParseLogFile(logFile) if err != nil { return nil, fmt.Errorf("failed to parse log file: %w", err) } for _, entry := range logEntries { // Filter by date range if entry.Timestamp.Before(startDate) || entry.Timestamp.After(endDate) { continue } // Check if this is a security-relevant event if sala.isSecurityEvent(entry) { event := sala.createSecurityEvent(entry) events = append(events, event) } } // Sort by timestamp sort.Slice(events, func(i, j int) bool { return events[i].Timestamp.Before(events[j].Timestamp) }) return events, nil } func (sala *SecurityAuditLogAnalyzer) isSecurityEvent(logEntry *LogEntry) bool { // Define patterns for security-relevant messages securityKeywords := []string{ "authentication", "authorization", "blocked", "failed", "denied", "unauthorized", "malicious", "attack", "intrusion", "breach", "validation", "signature", "key", "transaction", "nonce", } message := strings.ToLower(logEntry.Message) for _, keyword := range securityKeywords { if strings.Contains(message, keyword) { return true } } // Check for security-related fields securityFields := []string{"error", "status", "outcome", "result"} for field := range logEntry.Fields { for _, secField := range securityFields { if strings.Contains(strings.ToLower(field), secField) { return true } } } return false } func (sala *SecurityAuditLogAnalyzer) createSecurityEvent(logEntry *LogEntry) *SecurityEvent { // Categorize the event based on message content category := sala.categorizeSecurityEvent(logEntry) return &SecurityEvent{ Timestamp: logEntry.Timestamp, Level: logEntry.Level, Message: logEntry.Message, Fields: logEntry.Fields, Source: logEntry.Source, Category: category, } } func (sala *SecurityAuditLogAnalyzer) categorizeSecurityEvent(logEntry *LogEntry) string { message := strings.ToLower(logEntry.Message) // Define category patterns categories := map[string]*regexp.Regexp{ "authentication": regexp.MustCompile(`(?i)(auth|login|logout|session|token|credential|password)`), "authorization": regexp.MustCompile(`(?i)(permission|access|allow|deny|forbidden|unauthorized|privilege)`), "validation": regexp.MustCompile(`(?i)(validate|validation|error|invalid|malformed|check|verify)`), "transaction": regexp.MustCompile(`(?i)(transaction|block|revert|fail|error|nonce|gas|contract|call)`), "network": regexp.MustCompile(`(?i)(connection|ip|port|network|request|response|timeout)`), "crypto": regexp.MustCompile(`(?i)(signature|sign|verify|key|private|public|crypto|hash|encrypt|decrypt)`), } for category, pattern := range categories { if pattern.MatchString(message) { return category } } // If no specific category matches, classify as general security return "general" } func (sala *SecurityAuditLogAnalyzer) GenerateSecurityReport(startDate, endDate time.Time) (*SecurityReport, error) { events, err := sala.ParseSecurityEvents("security.log", startDate, endDate) if err != nil { return nil, fmt.Errorf("failed to parse events for report: %w", err) } report := &SecurityReport{ PeriodStart: startDate, PeriodEnd: endDate, ByCategory: make(map[string]int), BySeverity: make(map[string]int), } for _, event := range events { report.TotalEvents++ // Count by category report.ByCategory[event.Category]++ // Count by severity severity := sala.eventSeverity(event) report.BySeverity[severity]++ // Count critical events specifically if severity == "critical" || severity == "high" { report.CriticalEvents++ } } // Generate anomaly summaries report.Anomalies = sala.generateAnomalySummaries(events) return report, nil } func (sala *SecurityAuditLogAnalyzer) eventSeverity(event *SecurityEvent) string { // Determine severity based on log level and content level := strings.ToLower(event.Level) switch level { case "error", "critical", "fatal": return "high" case "warn", "warning": return "medium" case "info": // Check message content for severity indicators msg := strings.ToLower(event.Message) if strings.Contains(msg, "blocked") || strings.Contains(msg, "denied") { return "low" } return "info" default: return "info" } } func (sala *SecurityAuditLogAnalyzer) generateAnomalySummaries(events []*SecurityEvent) []AnomalySummary { // Group events by type and summarize eventCounts := make(map[string]*AnomalySummary) for _, event := range events { key := event.Category // Use category as the primary grouping type if summary, exists := eventCounts[key]; exists { summary.Count++ if event.Timestamp.After(summary.LastSeen) { summary.LastSeen = event.Timestamp } } else { eventCounts[key] = &AnomalySummary{ Type: key, Count: 1, FirstSeen: event.Timestamp, LastSeen: event.Timestamp, Description: fmt.Sprintf("Security events in category: %s", key), } } } // Convert map to slice and sort by count var summaries []AnomalySummary for _, summary := range eventCounts { summaries = append(summaries, *summary) } // Sort by count descending sort.Slice(summaries, func(i, j int) bool { return summaries[i].Count > summaries[j].Count }) return summaries } ``` ### 4. Add Performance Monitoring for Security Operations **Task ID:** LOW-003.4 **Time Estimate:** 1 hour **Dependencies:** LOW-003.1, LOW-003.2, LOW-003.3 Implement monitoring for security operation performance: - Track execution time of security-critical functions - Monitor resource usage during security operations - Alert on performance degradation of security features - Create dashboards showing security operation performance ```go import ( "context" "time" ) type SecurityPerformanceMonitor struct { metrics *MetricsClient alertSystem *AlertSystem thresholds PerformanceThresholds } type PerformanceThresholds struct { MaxValidationTime time.Duration // Max time for transaction validation MaxSignatureTime time.Duration // Max time for signature verification MaxRateLimitTime time.Duration // Max time for rate limiting check MaxEncryptionTime time.Duration // Max time for encryption operations } func NewSecurityPerformanceMonitor(metrics *MetricsClient, alertSystem *AlertSystem) *SecurityPerformanceMonitor { return &SecurityPerformanceMonitor{ metrics: metrics, alertSystem: alertSystem, thresholds: PerformanceThresholds{ MaxValidationTime: 50 * time.Millisecond, MaxSignatureTime: 100 * time.Millisecond, MaxRateLimitTime: 10 * time.Millisecond, MaxEncryptionTime: 50 * time.Millisecond, }, } } // Monitored transaction validation function func (spm *SecurityPerformanceMonitor) ValidateTransactionWithMonitoring(ctx context.Context, tx *Transaction) error { start := time.Now() // Create a context with timeout for this operation ctx, cancel := context.WithTimeout(ctx, spm.thresholds.MaxValidationTime*2) defer cancel() err := spm.validateTransactionInternal(ctx, tx) duration := time.Since(start) // Record metric securityOperationDuration.WithLabelValues("transaction_validation", getStatusLabel(err)).Observe(duration.Seconds()) // Check if operation took too long if duration > spm.thresholds.MaxValidationTime { spm.recordPerformanceViolation("transaction_validation", duration, spm.thresholds.MaxValidationTime) } return err } func (spm *SecurityPerformanceMonitor) validateTransactionInternal(ctx context.Context, tx *Transaction) error { // Run validation in a goroutine to allow timeout resultChan := make(chan error, 1) go func() { defer close(resultChan) // Actual validation logic here if err := validateNonce(tx); err != nil { resultChan <- err return } if err := validateGasLimit(tx); err != nil { resultChan <- err return } if err := validateSignature(tx); err != nil { resultChan <- err return } resultChan <- nil }() select { case result := <-resultChan: return result case <-ctx.Done(): securityOperationDuration.WithLabelValues("transaction_validation", "timeout").Observe( spm.thresholds.MaxValidationTime.Seconds()) return fmt.Errorf("transaction validation timed out: %w", ctx.Err()) } } // Monitored signature verification func (spm *SecurityPerformanceMonitor) VerifySignatureWithMonitoring(ctx context.Context, tx *Transaction) (bool, error) { start := time.Now() ctx, cancel := context.WithTimeout(ctx, spm.thresholds.MaxSignatureTime*2) defer cancel() valid, err := spm.verifySignatureInternal(ctx, tx) duration := time.Since(start) // Record metric status := "success" if err != nil { status = "error" } else if !valid { status = "invalid_signature" } securityOperationDuration.WithLabelValues("signature_verification", status).Observe(duration.Seconds()) // Check if operation took too long if duration > spm.thresholds.MaxSignatureTime { spm.recordPerformanceViolation("signature_verification", duration, spm.thresholds.MaxSignatureTime) } return valid, err } func (spm *SecurityPerformanceMonitor) recordPerformanceViolation(operation string, actual, threshold time.Duration) { // Log performance violation log.WithFields(log.Fields{ "operation": operation, "actual": actual.Seconds(), "threshold": threshold.Seconds(), }).Warn("Security operation performance threshold exceeded") // Increment violation counter performanceViolations.WithLabelValues(operation).Inc() // Send alert if this is significantly above threshold if actual > threshold*2 { spm.alertSystem.SendAlert("Security Performance Degradation", map[string]interface{}{ "operation": operation, "actual": actual.Seconds(), "threshold": threshold.Seconds(), "exceeded_by": actual.Seconds() - threshold.Seconds(), }) } } // Helper function to get status label for metrics func getStatusLabel(err error) string { if err != nil { return "error" } return "success" } // Performance monitoring middleware for HTTP endpoints func (spm *SecurityPerformanceMonitor) SecurityMiddleware(next http.Handler) http.Handler { return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { start := time.Now() // Capture response to get status code wrapped := &responseWriter{ResponseWriter: w, statusCode: 200} // Call the next handler next.ServeHTTP(wrapped, r) duration := time.Since(start) // Record metrics for security endpoints if isSecurityEndpoint(r.URL.Path) { securityEndpointDuration.WithLabelValues( r.URL.Path, fmt.Sprintf("%d", wrapped.statusCode), r.Method, ).Observe(duration.Seconds()) // Check threshold for security endpoints if duration > spm.thresholds.MaxRateLimitTime*10 { // 10x threshold for endpoints spm.recordPerformanceViolation( fmt.Sprintf("http_%s_%s", r.Method, r.URL.Path), duration, spm.thresholds.MaxRateLimitTime*10, ) } } }) } func isSecurityEndpoint(path string) bool { securityPaths := []string{ "/auth", "/login", "/logout", "/transaction", "/sign", "/validate", "/security", "/admin", } for _, secPath := range securityPaths { if strings.HasPrefix(path, secPath) { return true } } return false } // Response writer wrapper to capture status code type responseWriter struct { http.ResponseWriter statusCode int } func (rw *responseWriter) WriteHeader(code int) { rw.statusCode = code rw.ResponseWriter.WriteHeader(code) } ``` ## Implementation Integration ### Integration with Existing Components ```go // Initialize monitoring in main application func InitializeSecurityMonitoring() error { // Initialize metrics client metricsClient := initMetricsClient() // Initialize alert system alertSystem := initAlertSystem() // Create security performance monitor perfMonitor := NewSecurityPerformanceMonitor(metricsClient, alertSystem) // Create anomaly detector anomalyDetector := NewAnomalyDetector(metricsClient, alertSystem) // Create audit log analyzer auditAnalyzer := NewSecurityAuditLogAnalyzer(nil) // Use appropriate storage backend // Store in global context or pass to services that need monitoring globalSecurityMonitor = &SecurityMonitor{ Performance: perfMonitor, Anomaly: anomalyDetector, Audit: auditAnalyzer, } return nil } ``` ## Testing Strategy - Unit tests for each monitoring component - Integration tests for metrics collection - Load testing to verify monitoring doesn't impact performance - Test alerting functionality with mock systems ## Code Review Checklist - [ ] Security event metrics properly implemented and labeled - [ ] Anomaly detection algorithms are appropriate for the data - [ ] Audit log analysis tools handle edge cases properly - [ ] Performance monitoring doesn't impact system performance - [ ] Alerting thresholds are reasonable - [ ] Metrics are properly exported to monitoring system - [ ] Tests cover monitoring functionality ## Rollback Strategy If monitoring implementation causes issues: 1. Disable new monitoring components via configuration 2. Remove new metrics collection temporarily 3. Investigate and fix performance impacts ## Success Metrics - Security event metrics available in monitoring dashboard - Anomaly detection identifies actual unusual patterns - Audit log analysis tools provide actionable insights - Performance monitoring shows no degradation - Alert system properly notifies of security events - All new monitoring tests pass consistently