- Added comprehensive bounds checking to prevent buffer overruns in multicall parsing - Implemented graduated validation system (Strict/Moderate/Permissive) to reduce false positives - Added LRU caching system for address validation with 10-minute TTL - Enhanced ABI decoder with missing Universal Router and Arbitrum-specific DEX signatures - Fixed duplicate function declarations and import conflicts across multiple files - Added error recovery mechanisms with multiple fallback strategies - Updated tests to handle new validation behavior for suspicious addresses - Fixed parser test expectations for improved validation system - Applied gofmt formatting fixes to ensure code style compliance - Fixed mutex copying issues in monitoring package by introducing MetricsSnapshot - Resolved critical security vulnerabilities in heuristic address extraction - Progress: Updated TODO audit from 10% to 35% complete 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
27 KiB
LOW-003: Monitoring & Observability - Detailed Fix Plan
Issue ID: LOW-003
Category: Observability
Priority: Low
Status: Not Started
Generated: October 9, 2025
Estimate: 6-8 hours
Overview
This plan implements comprehensive monitoring and observability features including security event metrics, anomaly detection for unusual transaction patterns, security audit log analysis tools, and performance monitoring for security operations. The goal is to enhance visibility into system security and performance.
Current Implementation Issues
- Lack of security event metrics and dashboards
- No anomaly detection for unusual transaction patterns
- Missing security audit log analysis tools
- Absence of performance monitoring for security operations
Implementation Tasks
1. Add Security Event Metrics and Dashboards
Task ID: LOW-003.1
Time Estimate: 1.5 hours
Dependencies: None
Implement comprehensive security event metrics and visualization:
- Track security-relevant events (failed authentications, blocked transactions, etc.)
- Create Prometheus metrics for security events
- Design Grafana dashboards for security monitoring
- Implement alerting for security metric thresholds
import (
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
)
var (
// Security-related metrics
securityEvents = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "security_events_total",
Help: "Total number of security events by type",
},
[]string{"event_type", "severity"},
)
rateLimitExceeded = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "rate_limit_exceeded_total",
Help: "Total number of rate limit exceeded events by endpoint",
},
[]string{"endpoint"},
)
failedAuthentications = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "failed_authentications_total",
Help: "Total number of failed authentication attempts by source",
},
[]string{"source", "reason"},
)
blockedTransactions = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "blocked_transactions_total",
Help: "Total number of blocked transactions by reason",
},
[]string{"reason", "chain_id"},
)
securityOperationDuration = promauto.NewHistogramVec(
prometheus.HistogramOpts{
Name: "security_operation_duration_seconds",
Help: "Duration of security operations",
Buckets: prometheus.DefBuckets,
},
[]string{"operation", "status"},
)
)
// Example usage in security functions
func ValidateTransaction(tx *Transaction) error {
start := time.Now()
defer func() {
duration := time.Since(start)
securityOperationDuration.WithLabelValues("transaction_validation", "completed").Observe(duration.Seconds())
}()
// Validation logic here
if err := validateNonce(tx); err != nil {
blockedTransactions.WithLabelValues("invalid_nonce", tx.ChainId().String()).Inc()
return err
}
if err := validateGasLimit(tx); err != nil {
blockedTransactions.WithLabelValues("gas_limit_exceeded", tx.ChainId().String()).Inc()
return err
}
return nil
}
// Example for rate limiting
func (rl *RateLimiter) Allow(key string) bool {
start := time.Now()
defer func() {
duration := time.Since(start)
securityOperationDuration.WithLabelValues("rate_limit_check", "completed").Observe(duration.Seconds())
}()
allowed := rl.impl.Allow(key)
if !allowed {
rateLimitExceeded.WithLabelValues(extractEndpoint(key)).Inc()
}
return allowed
}
2. Implement Anomaly Detection for Unusual Transaction Patterns
Task ID: LOW-003.2
Time Estimate: 2 hours
Dependencies: LOW-003.1
Create anomaly detection system for identifying unusual transaction patterns:
- Analyze transaction frequency, amounts, and patterns
- Implement statistical models for baseline behavior
- Detect potential MEV attacks or unusual activity
- Generate alerts for detected anomalies
import (
"time"
"math"
)
type AnomalyDetector struct {
metrics *MetricsClient
alertSystem *AlertSystem
baselines map[string]*BaselineProfile
mu sync.RWMutex
windowSize time.Duration // Time window for pattern analysis
}
type BaselineProfile struct {
avgTransactions float64 // Average transactions per window
stdDev float64 // Standard deviation
recentValues []float64 // Recent values for trend analysis
lastUpdated time.Time
}
type AnomalyEvent struct {
Timestamp time.Time `json:"timestamp"`
Type string `json:"type"`
Severity string `json:"severity"`
Description string `json:"description"`
Context interface{} `json:"context"`
Score float64 `json:"score"` // 0.0-1.0 anomaly score
}
func NewAnomalyDetector(metrics *MetricsClient, alertSystem *AlertSystem) *AnomalyDetector {
return &AnomalyDetector{
metrics: metrics,
alertSystem: alertSystem,
baselines: make(map[string]*BaselineProfile),
windowSize: 1 * time.Hour,
}
}
func (ad *AnomalyDetector) AnalyzeTransactionPattern(tx *Transaction, accountAddress string) {
// Gather transaction statistics
currentRate := ad.getCurrentTransactionRate(accountAddress)
baseline, exists := ad.baselines[accountAddress]
if !exists {
ad.updateBaseline(accountAddress, currentRate)
return
}
// Calculate z-score to determine anomaly level
zScore := math.Abs(currentRate-baseline.avgTransactions) / baseline.stdDev
// If z-score exceeds threshold, consider it an anomaly
if zScore > 3.0 { // Using 3 standard deviations as threshold
ad.reportAnomaly(&AnomalyEvent{
Timestamp: time.Now(),
Type: "transaction_rate_anomaly",
Severity: ad.getSeverity(zScore),
Description: fmt.Sprintf("Unusual transaction rate detected: %.2fx baseline for account %s",
currentRate/baseline.avgTransactions, accountAddress),
Context: map[string]interface{}{
"account_address": accountAddress,
"current_rate": currentRate,
"baseline_rate": baseline.avgTransactions,
"z_score": zScore,
},
Score: zScore / 10.0, // Normalize to 0-1 scale
})
}
// Update baseline for next analysis
ad.updateBaseline(accountAddress, currentRate)
}
func (ad *AnomalyDetector) getCurrentTransactionRate(accountAddress string) float64 {
// Query metrics to get transaction count in recent window
// This would typically come from a metrics backend like Prometheus
query := fmt.Sprintf(
`increase(transaction_count_total{account="%s"}[1h])`,
accountAddress,
)
result, err := ad.metrics.Query(query)
if err != nil {
// Log error but don't fail the detection
return 0
}
if len(result) > 0 {
return result[0].Value
}
return 0
}
func (ad *AnomalyDetector) updateBaseline(accountAddress string, newValue float64) {
ad.mu.Lock()
defer ad.mu.Unlock()
baseline, exists := ad.baselines[accountAddress]
if !exists {
// Initialize new baseline
ad.baselines[accountAddress] = &BaselineProfile{
avgTransactions: newValue,
stdDev: 0,
recentValues: []float64{newValue},
lastUpdated: time.Now(),
}
return
}
// Update rolling average and standard deviation
baseline.recentValues = append(baseline.recentValues, newValue)
// Keep only last N values for rolling calculation
maxHistory := 24 // last 24 hours worth of data
if len(baseline.recentValues) > maxHistory {
baseline.recentValues = baseline.recentValues[len(baseline.recentValues)-maxHistory:]
}
// Recalculate baseline statistics
baseline.avgTransactions = ad.calculateMean(baseline.recentValues)
baseline.stdDev = ad.calculateStdDev(baseline.recentValues, baseline.avgTransactions)
baseline.lastUpdated = time.Now()
}
func (ad *AnomalyDetector) calculateMean(values []float64) float64 {
if len(values) == 0 {
return 0
}
sum := 0.0
for _, v := range values {
sum += v
}
return sum / float64(len(values))
}
func (ad *AnomalyDetector) calculateStdDev(values []float64, mean float64) float64 {
if len(values) <= 1 {
return 0
}
sum := 0.0
for _, v := range values {
diff := v - mean
sum += diff * diff
}
variance := sum / float64(len(values)-1)
return math.Sqrt(variance)
}
func (ad *AnomalyDetector) reportAnomaly(event *AnomalyEvent) {
// Log the anomaly
log.WithFields(log.Fields{
"timestamp": event.Timestamp,
"type": event.Type,
"severity": event.Severity,
"score": event.Score,
}).Warn("Anomaly detected: " + event.Description)
// Send to metrics system
anomalyScore.WithLabelValues(event.Type, event.Severity).Set(event.Score)
// Trigger alert if severity is high enough
if ad.shouldAlert(event.Severity) {
ad.alertSystem.SendAlert("Security Anomaly Detected", map[string]interface{}{
"event": event,
})
}
}
func (ad *AnomalyDetector) getSeverity(score float64) string {
switch {
case score > 5.0:
return "critical"
case score > 3.0:
return "high"
case score > 2.0:
return "medium"
default:
return "low"
}
}
func (ad *AnomalyDetector) shouldAlert(severity string) bool {
return severity == "critical" || severity == "high"
}
3. Create Security Audit Log Analysis Tools
Task ID: LOW-003.3
Time Estimate: 1.5 hours
Dependencies: LOW-003.1
Develop tools for analyzing security audit logs:
- Create parsers for security-relevant log entries
- Implement aggregation and analysis functions
- Build summary reports for security events
- Create search and filtering capabilities
import (
"encoding/json"
"time"
"strings"
"regexp"
)
type SecurityAuditLogAnalyzer struct {
logParser *LogParser
storage StorageBackend
}
type SecurityEvent struct {
Timestamp time.Time `json:"timestamp"`
Level string `json:"level"`
Message string `json:"message"`
Fields map[string]interface{} `json:"fields"`
Source string `json:"source"`
Category string `json:"category"`
}
type SecurityReport struct {
PeriodStart time.Time `json:"period_start"`
PeriodEnd time.Time `json:"period_end"`
TotalEvents int `json:"total_events"`
CriticalEvents int `json:"critical_events"`
ByCategory map[string]int `json:"by_category"`
BySeverity map[string]int `json:"by_severity"`
Anomalies []AnomalySummary `json:"anomalies"`
}
type AnomalySummary struct {
Type string `json:"type"`
Count int `json:"count"`
FirstSeen time.Time `json:"first_seen"`
LastSeen time.Time `json:"last_seen"`
Description string `json:"description"`
}
func NewSecurityAuditLogAnalyzer(storage StorageBackend) *SecurityAuditLogAnalyzer {
return &SecurityAuditLogAnalyzer{
logParser: NewLogParser(),
storage: storage,
}
}
func (sala *SecurityAuditLogAnalyzer) ParseSecurityEvents(logFile string, startDate, endDate time.Time) ([]*SecurityEvent, error) {
var events []*SecurityEvent
logEntries, err := sala.logParser.ParseLogFile(logFile)
if err != nil {
return nil, fmt.Errorf("failed to parse log file: %w", err)
}
for _, entry := range logEntries {
// Filter by date range
if entry.Timestamp.Before(startDate) || entry.Timestamp.After(endDate) {
continue
}
// Check if this is a security-relevant event
if sala.isSecurityEvent(entry) {
event := sala.createSecurityEvent(entry)
events = append(events, event)
}
}
// Sort by timestamp
sort.Slice(events, func(i, j int) bool {
return events[i].Timestamp.Before(events[j].Timestamp)
})
return events, nil
}
func (sala *SecurityAuditLogAnalyzer) isSecurityEvent(logEntry *LogEntry) bool {
// Define patterns for security-relevant messages
securityKeywords := []string{
"authentication", "authorization", "blocked", "failed", "denied",
"unauthorized", "malicious", "attack", "intrusion", "breach",
"validation", "signature", "key", "transaction", "nonce",
}
message := strings.ToLower(logEntry.Message)
for _, keyword := range securityKeywords {
if strings.Contains(message, keyword) {
return true
}
}
// Check for security-related fields
securityFields := []string{"error", "status", "outcome", "result"}
for field := range logEntry.Fields {
for _, secField := range securityFields {
if strings.Contains(strings.ToLower(field), secField) {
return true
}
}
}
return false
}
func (sala *SecurityAuditLogAnalyzer) createSecurityEvent(logEntry *LogEntry) *SecurityEvent {
// Categorize the event based on message content
category := sala.categorizeSecurityEvent(logEntry)
return &SecurityEvent{
Timestamp: logEntry.Timestamp,
Level: logEntry.Level,
Message: logEntry.Message,
Fields: logEntry.Fields,
Source: logEntry.Source,
Category: category,
}
}
func (sala *SecurityAuditLogAnalyzer) categorizeSecurityEvent(logEntry *LogEntry) string {
message := strings.ToLower(logEntry.Message)
// Define category patterns
categories := map[string]*regexp.Regexp{
"authentication": regexp.MustCompile(`(?i)(auth|login|logout|session|token|credential|password)`),
"authorization": regexp.MustCompile(`(?i)(permission|access|allow|deny|forbidden|unauthorized|privilege)`),
"validation": regexp.MustCompile(`(?i)(validate|validation|error|invalid|malformed|check|verify)`),
"transaction": regexp.MustCompile(`(?i)(transaction|block|revert|fail|error|nonce|gas|contract|call)`),
"network": regexp.MustCompile(`(?i)(connection|ip|port|network|request|response|timeout)`),
"crypto": regexp.MustCompile(`(?i)(signature|sign|verify|key|private|public|crypto|hash|encrypt|decrypt)`),
}
for category, pattern := range categories {
if pattern.MatchString(message) {
return category
}
}
// If no specific category matches, classify as general security
return "general"
}
func (sala *SecurityAuditLogAnalyzer) GenerateSecurityReport(startDate, endDate time.Time) (*SecurityReport, error) {
events, err := sala.ParseSecurityEvents("security.log", startDate, endDate)
if err != nil {
return nil, fmt.Errorf("failed to parse events for report: %w", err)
}
report := &SecurityReport{
PeriodStart: startDate,
PeriodEnd: endDate,
ByCategory: make(map[string]int),
BySeverity: make(map[string]int),
}
for _, event := range events {
report.TotalEvents++
// Count by category
report.ByCategory[event.Category]++
// Count by severity
severity := sala.eventSeverity(event)
report.BySeverity[severity]++
// Count critical events specifically
if severity == "critical" || severity == "high" {
report.CriticalEvents++
}
}
// Generate anomaly summaries
report.Anomalies = sala.generateAnomalySummaries(events)
return report, nil
}
func (sala *SecurityAuditLogAnalyzer) eventSeverity(event *SecurityEvent) string {
// Determine severity based on log level and content
level := strings.ToLower(event.Level)
switch level {
case "error", "critical", "fatal":
return "high"
case "warn", "warning":
return "medium"
case "info":
// Check message content for severity indicators
msg := strings.ToLower(event.Message)
if strings.Contains(msg, "blocked") || strings.Contains(msg, "denied") {
return "low"
}
return "info"
default:
return "info"
}
}
func (sala *SecurityAuditLogAnalyzer) generateAnomalySummaries(events []*SecurityEvent) []AnomalySummary {
// Group events by type and summarize
eventCounts := make(map[string]*AnomalySummary)
for _, event := range events {
key := event.Category // Use category as the primary grouping type
if summary, exists := eventCounts[key]; exists {
summary.Count++
if event.Timestamp.After(summary.LastSeen) {
summary.LastSeen = event.Timestamp
}
} else {
eventCounts[key] = &AnomalySummary{
Type: key,
Count: 1,
FirstSeen: event.Timestamp,
LastSeen: event.Timestamp,
Description: fmt.Sprintf("Security events in category: %s", key),
}
}
}
// Convert map to slice and sort by count
var summaries []AnomalySummary
for _, summary := range eventCounts {
summaries = append(summaries, *summary)
}
// Sort by count descending
sort.Slice(summaries, func(i, j int) bool {
return summaries[i].Count > summaries[j].Count
})
return summaries
}
4. Add Performance Monitoring for Security Operations
Task ID: LOW-003.4
Time Estimate: 1 hour
Dependencies: LOW-003.1, LOW-003.2, LOW-003.3
Implement monitoring for security operation performance:
- Track execution time of security-critical functions
- Monitor resource usage during security operations
- Alert on performance degradation of security features
- Create dashboards showing security operation performance
import (
"context"
"time"
)
type SecurityPerformanceMonitor struct {
metrics *MetricsClient
alertSystem *AlertSystem
thresholds PerformanceThresholds
}
type PerformanceThresholds struct {
MaxValidationTime time.Duration // Max time for transaction validation
MaxSignatureTime time.Duration // Max time for signature verification
MaxRateLimitTime time.Duration // Max time for rate limiting check
MaxEncryptionTime time.Duration // Max time for encryption operations
}
func NewSecurityPerformanceMonitor(metrics *MetricsClient, alertSystem *AlertSystem) *SecurityPerformanceMonitor {
return &SecurityPerformanceMonitor{
metrics: metrics,
alertSystem: alertSystem,
thresholds: PerformanceThresholds{
MaxValidationTime: 50 * time.Millisecond,
MaxSignatureTime: 100 * time.Millisecond,
MaxRateLimitTime: 10 * time.Millisecond,
MaxEncryptionTime: 50 * time.Millisecond,
},
}
}
// Monitored transaction validation function
func (spm *SecurityPerformanceMonitor) ValidateTransactionWithMonitoring(ctx context.Context, tx *Transaction) error {
start := time.Now()
// Create a context with timeout for this operation
ctx, cancel := context.WithTimeout(ctx, spm.thresholds.MaxValidationTime*2)
defer cancel()
err := spm.validateTransactionInternal(ctx, tx)
duration := time.Since(start)
// Record metric
securityOperationDuration.WithLabelValues("transaction_validation", getStatusLabel(err)).Observe(duration.Seconds())
// Check if operation took too long
if duration > spm.thresholds.MaxValidationTime {
spm.recordPerformanceViolation("transaction_validation", duration, spm.thresholds.MaxValidationTime)
}
return err
}
func (spm *SecurityPerformanceMonitor) validateTransactionInternal(ctx context.Context, tx *Transaction) error {
// Run validation in a goroutine to allow timeout
resultChan := make(chan error, 1)
go func() {
defer close(resultChan)
// Actual validation logic here
if err := validateNonce(tx); err != nil {
resultChan <- err
return
}
if err := validateGasLimit(tx); err != nil {
resultChan <- err
return
}
if err := validateSignature(tx); err != nil {
resultChan <- err
return
}
resultChan <- nil
}()
select {
case result := <-resultChan:
return result
case <-ctx.Done():
securityOperationDuration.WithLabelValues("transaction_validation", "timeout").Observe(
spm.thresholds.MaxValidationTime.Seconds())
return fmt.Errorf("transaction validation timed out: %w", ctx.Err())
}
}
// Monitored signature verification
func (spm *SecurityPerformanceMonitor) VerifySignatureWithMonitoring(ctx context.Context, tx *Transaction) (bool, error) {
start := time.Now()
ctx, cancel := context.WithTimeout(ctx, spm.thresholds.MaxSignatureTime*2)
defer cancel()
valid, err := spm.verifySignatureInternal(ctx, tx)
duration := time.Since(start)
// Record metric
status := "success"
if err != nil {
status = "error"
} else if !valid {
status = "invalid_signature"
}
securityOperationDuration.WithLabelValues("signature_verification", status).Observe(duration.Seconds())
// Check if operation took too long
if duration > spm.thresholds.MaxSignatureTime {
spm.recordPerformanceViolation("signature_verification", duration, spm.thresholds.MaxSignatureTime)
}
return valid, err
}
func (spm *SecurityPerformanceMonitor) recordPerformanceViolation(operation string, actual, threshold time.Duration) {
// Log performance violation
log.WithFields(log.Fields{
"operation": operation,
"actual": actual.Seconds(),
"threshold": threshold.Seconds(),
}).Warn("Security operation performance threshold exceeded")
// Increment violation counter
performanceViolations.WithLabelValues(operation).Inc()
// Send alert if this is significantly above threshold
if actual > threshold*2 {
spm.alertSystem.SendAlert("Security Performance Degradation", map[string]interface{}{
"operation": operation,
"actual": actual.Seconds(),
"threshold": threshold.Seconds(),
"exceeded_by": actual.Seconds() - threshold.Seconds(),
})
}
}
// Helper function to get status label for metrics
func getStatusLabel(err error) string {
if err != nil {
return "error"
}
return "success"
}
// Performance monitoring middleware for HTTP endpoints
func (spm *SecurityPerformanceMonitor) SecurityMiddleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
start := time.Now()
// Capture response to get status code
wrapped := &responseWriter{ResponseWriter: w, statusCode: 200}
// Call the next handler
next.ServeHTTP(wrapped, r)
duration := time.Since(start)
// Record metrics for security endpoints
if isSecurityEndpoint(r.URL.Path) {
securityEndpointDuration.WithLabelValues(
r.URL.Path,
fmt.Sprintf("%d", wrapped.statusCode),
r.Method,
).Observe(duration.Seconds())
// Check threshold for security endpoints
if duration > spm.thresholds.MaxRateLimitTime*10 { // 10x threshold for endpoints
spm.recordPerformanceViolation(
fmt.Sprintf("http_%s_%s", r.Method, r.URL.Path),
duration,
spm.thresholds.MaxRateLimitTime*10,
)
}
}
})
}
func isSecurityEndpoint(path string) bool {
securityPaths := []string{
"/auth", "/login", "/logout",
"/transaction", "/sign", "/validate",
"/security", "/admin",
}
for _, secPath := range securityPaths {
if strings.HasPrefix(path, secPath) {
return true
}
}
return false
}
// Response writer wrapper to capture status code
type responseWriter struct {
http.ResponseWriter
statusCode int
}
func (rw *responseWriter) WriteHeader(code int) {
rw.statusCode = code
rw.ResponseWriter.WriteHeader(code)
}
Implementation Integration
Integration with Existing Components
// Initialize monitoring in main application
func InitializeSecurityMonitoring() error {
// Initialize metrics client
metricsClient := initMetricsClient()
// Initialize alert system
alertSystem := initAlertSystem()
// Create security performance monitor
perfMonitor := NewSecurityPerformanceMonitor(metricsClient, alertSystem)
// Create anomaly detector
anomalyDetector := NewAnomalyDetector(metricsClient, alertSystem)
// Create audit log analyzer
auditAnalyzer := NewSecurityAuditLogAnalyzer(nil) // Use appropriate storage backend
// Store in global context or pass to services that need monitoring
globalSecurityMonitor = &SecurityMonitor{
Performance: perfMonitor,
Anomaly: anomalyDetector,
Audit: auditAnalyzer,
}
return nil
}
Testing Strategy
- Unit tests for each monitoring component
- Integration tests for metrics collection
- Load testing to verify monitoring doesn't impact performance
- Test alerting functionality with mock systems
Code Review Checklist
- Security event metrics properly implemented and labeled
- Anomaly detection algorithms are appropriate for the data
- Audit log analysis tools handle edge cases properly
- Performance monitoring doesn't impact system performance
- Alerting thresholds are reasonable
- Metrics are properly exported to monitoring system
- Tests cover monitoring functionality
Rollback Strategy
If monitoring implementation causes issues:
- Disable new monitoring components via configuration
- Remove new metrics collection temporarily
- Investigate and fix performance impacts
Success Metrics
- Security event metrics available in monitoring dashboard
- Anomaly detection identifies actual unusual patterns
- Audit log analysis tools provide actionable insights
- Performance monitoring shows no degradation
- Alert system properly notifies of security events
- All new monitoring tests pass consistently