Skip to content

Observability & Monitoring User Guide

Observability Overview

The AI Security Gateway provides comprehensive observability and monitoring capabilities through enhanced metrics and distributed tracing. This guide explains how to configure, access, and leverage these features to monitor system health, performance, and troubleshoot issues.

Observability Table of Contents

  1. Enhanced Metrics
  2. Distributed Tracing
  3. Integration with Monitoring Systems
  4. Best Practices
  5. Troubleshooting

Enhanced Metrics

Metrics Overview

Enhanced metrics provide detailed insights into system performance, including request durations, error rates, database query performance, proxy metrics, and policy evaluation statistics. All metrics are available via REST API and Prometheus-compatible endpoints.

Metrics Configuration

Enhanced metrics are enabled by default and require no additional configuration. They automatically track:

  • Request Metrics: Total requests, errors, slow requests, duration percentiles (p50, p95, p99)
  • Database Metrics: Total queries, slow queries, duration percentiles
  • Error Metrics: Breakdown by endpoint and status code
  • Proxy Metrics: Per-proxy request counts, success/error rates, duration
  • Policy Metrics: Total evaluations, matches, misses, hit rate, per-policy details
  • System Metrics: CPU, memory, goroutines
  • Connection Pool Metrics: Utilization, wait counts, connection states
  • Cache Metrics: Hits, misses, hit rate

Accessing Metrics

1. REST API Endpoint

Endpoint: GET /api/v1/metrics

Example Request:

bash
curl -X GET http://localhost:8080/api/v1/metrics \
  -H "Authorization: Bearer YOUR_JWT_TOKEN"

Response Structure:

json
{
  "success": true,
  "data": {
    "request_metrics": {
      "total_requests": 1250,
      "total_errors": 23,
      "slow_requests": 5,
      "duration_p50": "45ms",
      "duration_p95": "120ms",
      "duration_p99": "250ms"
    },
    "database_metrics": {
      "total_queries": 3450,
      "slow_queries": 12,
      "duration_p50": "2ms",
      "duration_p95": "8ms",
      "duration_p99": "15ms"
    },
    "error_metrics": {
      "/api/v1/proxies": {
        "400": 5,
        "404": 3,
        "500": 2
      }
    },
    "proxy_metrics": {
      "1": {
        "proxy_id": 1,
        "proxy_name": "MCP Server 1",
        "proxy_type": "mcp",
        "requests_total": 450,
        "requests_success": 435,
        "requests_error": 15,
        "request_duration": "125ms",
        "last_request_time": "2025-01-15T10:30:00Z"
      }
    },
    "policy_metrics": {
      "total_evaluations": 12500,
      "matches": 125,
      "misses": 12375,
      "average_duration": "2ms",
      "hit_rate_percent": 1.0,
      "policy_details": {
        "malicious-prompt-detection": {
          "policy_name": "malicious-prompt-detection",
          "evaluations": 5000,
          "matches": 50,
          "average_duration": "1.5ms"
        }
      }
    },
    "system_metrics": {
      "cpu_usage_percent": 25.5,
      "memory_usage_mb": 512,
      "goroutine_count": 45
    },
    "connection_pool_metrics": {
      "open_connections": 10,
      "in_use": 3,
      "idle": 7,
      "wait_count": 0,
      "utilization_percent": 30.0
    },
    "cache_metrics": {
      "hits": 1250,
      "misses": 250,
      "hit_rate": 83.3
    }
  }
}

2. Prometheus Endpoint

Endpoint: GET /api/v1/metrics/prometheus

Example Request:

bash
curl -X GET http://localhost:8080/api/v1/metrics/prometheus

Response Format (Prometheus text format):

# HELP gateway_requests_total Total number of HTTP requests
# TYPE gateway_requests_total counter
gateway_requests_total 1250

# HELP gateway_request_duration_seconds Request duration in seconds
# TYPE gateway_request_duration_seconds histogram
gateway_request_duration_seconds_bucket{le="0.005"} 100
gateway_request_duration_seconds_bucket{le="0.01"} 500
gateway_request_duration_seconds_bucket{le="0.05"} 1000
gateway_request_duration_seconds_bucket{le="0.1"} 1200
gateway_request_duration_seconds_bucket{le="+Inf"} 1250
gateway_request_duration_seconds_sum 45.2
gateway_request_duration_seconds_count 1250

# HELP gateway_errors_total Total number of errors by endpoint and status
# TYPE gateway_errors_total counter
gateway_errors_total{endpoint="/api/v1/proxies",status="400"} 5
gateway_errors_total{endpoint="/api/v1/proxies",status="500"} 2

# HELP gateway_proxy_requests_total Total proxy requests by proxy
# TYPE gateway_proxy_requests_total counter
gateway_proxy_requests_total{proxy_id="1",proxy_name="MCP Server 1",proxy_type="mcp"} 450

# HELP gateway_policy_evaluations_total Total policy evaluations
# TYPE gateway_policy_evaluations_total counter
gateway_policy_evaluations_total 12500
gateway_policy_evaluations_matches_total 125
gateway_policy_evaluations_misses_total 12375

# HELP gateway_db_queries_total Total database queries
# TYPE gateway_db_queries_total counter
gateway_db_queries_total 3450

3. Connection Pool Health Endpoint

Endpoint: GET /api/v1/system/db/pool/health

Example Request:

bash
curl -X GET http://localhost:8080/api/v1/system/db/pool/health \
  -H "Authorization: Bearer YOUR_JWT_TOKEN"

Response:

json
{
  "success": true,
  "data": {
    "status": "healthy",
    "utilization_percent": 30.0,
    "open_connections": 10,
    "in_use": 3,
    "idle": 7,
    "wait_count": 0
  }
}

Status Values:

  • healthy - Normal operation (< 80% utilization, no waits)
  • degraded - Elevated utilization (80-90%) or occasional waits
  • critical - High utilization (> 90%) or excessive waits

4. Connection Pool Stats Endpoint

Endpoint: GET /api/v1/system/db/pool/stats

Example Request:

bash
curl -X GET http://localhost:8080/api/v1/system/db/pool/stats \
  -H "Authorization: Bearer YOUR_JWT_TOKEN"

Response:

json
{
  "success": true,
  "data": {
    "max_open_connections": 25,
    "open_connections": 10,
    "in_use": 3,
    "idle": 7,
    "wait_count": 0,
    "wait_duration_ns": 0,
    "max_idle_closed": 0,
    "max_lifetime_closed": 0,
    "utilization_percent": 30.0
  }
}

Metrics Dashboard Integration

Grafana Dashboard Setup

  1. Configure Prometheus Data Source:

    • Add Prometheus as a data source in Grafana
    • URL: http://your-gateway:8080/api/v1/metrics/prometheus
  2. Example Queries:

    Request Rate:

    promql
    rate(gateway_requests_total[5m])

    Error Rate:

    promql
    rate(gateway_errors_total[5m])

    Request Duration (p95):

    promql
    histogram_quantile(0.95, rate(gateway_request_duration_seconds_bucket[5m]))

    Proxy Request Rate:

    promql
    rate(gateway_proxy_requests_total[5m])

    Policy Hit Rate:

    promql
    rate(gateway_policy_evaluations_matches_total[5m]) / rate(gateway_policy_evaluations_total[5m]) * 100

    Database Query Duration (p99):

    promql
    histogram_quantile(0.99, rate(gateway_db_query_duration_seconds_bucket[5m]))

    Connection Pool Utilization:

    promql
    gateway_db_pool_utilization_percent
  3. Recommended Dashboard Panels:

    • Request rate (requests/second)
    • Error rate by endpoint
    • Request duration percentiles (p50, p95, p99)
    • Proxy request rates by proxy
    • Policy evaluation metrics
    • Database query performance
    • Connection pool utilization
    • System resource usage (CPU, memory, goroutines)

Using Metrics for Monitoring

Alerting Rules (Prometheus)

Example alerting rules you can configure in Prometheus:

yaml
groups:
  - name: gateway_alerts
    rules:
      # High error rate
      - alert: HighErrorRate
        expr: rate(gateway_errors_total[5m]) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value }} errors/second"

      # Slow requests
      - alert: SlowRequests
        expr: histogram_quantile(0.95, rate(gateway_request_duration_seconds_bucket[5m])) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Slow requests detected"
          description: "95th percentile request duration is {{ $value }}s"

      # Connection pool exhaustion
      - alert: ConnectionPoolExhaustion
        expr: gateway_db_pool_utilization_percent > 90
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Connection pool near exhaustion"
          description: "Pool utilization is {{ $value }}%"

      # High policy evaluation time
      - alert: SlowPolicyEvaluation
        expr: gateway_policy_evaluation_duration_seconds > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Slow policy evaluation"
          description: "Policy evaluation taking {{ $value }}s"

Distributed Tracing

Tracing Overview

Distributed tracing provides end-to-end visibility into request flows across handlers, services, and repositories. This helps debug issues, understand performance bottlenecks, and trace request paths through the system.

Tracing Configuration

Tracing is disabled by default and must be explicitly enabled via environment variables.

Environment Variables

VariableDescriptionDefaultExample
TRACING_ENABLEDEnable/disable tracingfalsetrue
TRACING_SERVICE_NAMEService name for tracesai-security-gatewayai-security-gateway
TRACING_ENVIRONMENTEnvironment namedevelopmentproduction
TRACING_JAEGER_URLJaeger collector endpoint-http://localhost:14268/api/traces
TRACING_OTLP_ENDPOINTOTLP HTTP endpoint-http://localhost:4318
TRACING_SAMPLE_RATESampling rate (0.0-1.0)1.00.1 (10% sampling)

Setup Instructions

Option 1: Jaeger (Quick Start)

  1. Start Jaeger:

    bash
    docker run -d \
      --name jaeger \
      -p 16686:16686 \
      -p 14268:14268 \
      jaegertracing/all-in-one:latest
  2. Configure Gateway:

    bash
    export TRACING_ENABLED=true
    export TRACING_JAEGER_URL=http://localhost:14268/api/traces
    export TRACING_SERVICE_NAME=ai-security-gateway
    export TRACING_ENVIRONMENT=development
  3. View Traces:

Option 2: OTLP (OpenTelemetry Protocol)

  1. Start OTLP Collector:

    bash
    docker run -d \
      --name otel-collector \
      -p 4318:4318 \
      -v /path/to/otel-collector-config.yaml:/etc/otel-collector-config.yaml \
      otel/opentelemetry-collector:latest \
      --config=/etc/otel-collector-config.yaml
  2. Configure Gateway:

    bash
    export TRACING_ENABLED=true
    export TRACING_OTLP_ENDPOINT=http://localhost:4318
    export TRACING_SERVICE_NAME=ai-security-gateway
    export TRACING_ENVIRONMENT=production
  3. OTLP Collector Configuration (otel-collector-config.yaml):

    yaml
    receivers:
      otlp:
        protocols:
          http:
            endpoint: 0.0.0.0:4318
    
    exporters:
      jaeger:
        endpoint: jaeger:14250
      logging:
        loglevel: debug
    
    service:
      pipelines:
        traces:
          receivers: [otlp]
          exporters: [jaeger, logging]

How Tracing Works

Automatic Tracing

All HTTP requests are automatically traced via middleware. No additional code is required for basic request tracing.

What's Captured:

  • Request method, URL, path
  • Request headers (user agent, remote address)
  • Response status code
  • Response size
  • Request duration
  • Errors (if any)

Trace Context Propagation

Trace context is automatically propagated through:

  • HTTP handlers → Services → Repositories
  • All function calls that accept context.Context
  • Database queries
  • Policy evaluations
  • Proxy requests

Trace Structure

A typical trace looks like this:

http.request (root span)
├── service.MultiProxyService.ListProxyConfigs
│   └── repo.ProxyConfigRepository.List
│       └── db.query (SELECT * FROM proxy_configs...)
├── service.DashboardService.GetPolicies
│   └── repo.PolicyRepository.GetActiveStatusBatch
│       └── db.query (SELECT name, active FROM policies...)
└── service.PolicyAssignmentService.GetEnabledAssignments
    └── repo.PolicyAssignmentRepository.GetByProxyID
        └── db.query (SELECT * FROM policy_assignments...)

Viewing Traces

Jaeger UI

  1. Access Jaeger UI: http://localhost:16686

  2. Search for Traces:

    • Select service: ai-security-gateway
    • Choose time range
    • Optionally filter by operation, tags, or duration
  3. View Trace Details:

    • Click on a trace to see the full span tree
    • View span attributes (request details, errors, etc.)
    • See timing breakdown for each operation
  4. Example Trace View:

    Trace: abc123def456
    Duration: 45ms
    
    [http.request] 45ms
      ├─ [service.MultiProxyService.ListProxyConfigs] 30ms
      │   └─ [repo.ProxyConfigRepository.List] 25ms
      │       └─ [db.query] 20ms
      └─ [service.DashboardService.GetPolicies] 10ms
          └─ [repo.PolicyRepository.GetActiveStatusBatch] 8ms

Trace Attributes

Each span includes relevant attributes:

HTTP Request Spans:

  • http.method: HTTP method (GET, POST, etc.)
  • http.url: Full request URL
  • http.route: Route path
  • http.status_code: Response status code
  • http.response.size: Response size in bytes
  • http.user_agent: Client user agent
  • http.remote_addr: Client IP address

Service Spans:

  • service.name: Service name
  • service.method: Method name

Repository Spans:

  • repository.name: Repository name
  • repository.method: Method name

Database Spans:

  • db.operation: Operation type (SELECT, INSERT, etc.)
  • db.statement: SQL query (sanitized)
  • db.sql.table: Table name

Policy Spans:

  • policy.name: Policy name
  • policy.matched: Whether policy matched

Proxy Spans:

  • proxy.id: Proxy ID
  • proxy.name: Proxy name
  • proxy.type: Proxy type (mcp, llm)

Sampling Configuration

For high-traffic scenarios, configure sampling to reduce trace volume:

bash
# Sample 10% of requests
export TRACING_SAMPLE_RATE=0.1

# Sample 50% of requests
export TRACING_SAMPLE_RATE=0.5

# Sample all requests (default)
export TRACING_SAMPLE_RATE=1.0

Sampling Strategy:

  • Use 1.0 (100%) for development and low-traffic production
  • Use 0.1 (10%) for high-traffic production
  • Use 0.01 (1%) for very high-traffic scenarios

Trace Context in Logs

Trace IDs and Span IDs are automatically included in structured logs when available. This allows correlating logs with traces.

Example Log Entry:

2025-01-15 10:30:00 INFO [api-server] Failed to list proxy configurations: database connection timeout
  trace_id=abc123def456
  span_id=789xyz012

Extracting Trace Context (for custom logging):

go
import "github.com/syphon1c/ai-security-gateway/internal/tracing"

traceID := tracing.TraceIDFromContext(ctx)
spanID := tracing.SpanIDFromContext(ctx)
logger.Info("Operation completed", "trace_id", traceID, "span_id", spanID)

Integration with Monitoring Systems

Prometheus + Grafana

1. Configure Prometheus

prometheus.yml:

yaml
scrape_configs:
  - job_name: 'ai-security-gateway'
    scrape_interval: 15s
    metrics_path: '/api/v1/metrics/prometheus'
    static_configs:
      - targets: ['localhost:8080']

2. Start Prometheus

bash
docker run -d \
  --name prometheus \
  -p 9090:9090 \
  -v /path/to/prometheus.yml:/etc/prometheus/prometheus.yml \
  prom/prometheus:latest

3. Create Grafana Dashboard

Import the following dashboard JSON or create panels manually:

Key Panels:

  • Request rate over time
  • Error rate by endpoint
  • Request duration percentiles
  • Proxy metrics by proxy
  • Policy evaluation metrics
  • Database query performance
  • Connection pool health
  • System resource usage

Datadog Integration

1. Configure Datadog OTLP Exporter

bash
export TRACING_ENABLED=true
export TRACING_OTLP_ENDPOINT=https://trace-intake.datadoghq.com
export TRACING_SERVICE_NAME=ai-security-gateway

2. Configure Datadog Agent

The Datadog agent will receive traces via OTLP and forward them to Datadog.

New Relic Integration

1. Configure New Relic OTLP Exporter

bash
export TRACING_ENABLED=true
export TRACING_OTLP_ENDPOINT=https://otlp.nr-data.net:4318
export TRACING_SERVICE_NAME=ai-security-gateway

2. Add API Key Header

Modify internal/tracing/tracer.go to include New Relic API key in OTLP headers.

Custom Observability Backend

Any OTLP-compatible backend can be used:

bash
export TRACING_ENABLED=true
export TRACING_OTLP_ENDPOINT=http://your-otel-backend:4318

Best Practices

Metrics

  1. Monitor Key Metrics:

    • Request rate and error rate
    • Request duration percentiles (p95, p99)
    • Database query performance
    • Connection pool utilization
    • Policy evaluation performance
  2. Set Up Alerts:

    • High error rates (> 1% of requests)
    • Slow requests (p95 > 1s)
    • Connection pool exhaustion (> 90% utilization)
    • High database query times
  3. Regular Review:

    • Review metrics dashboard daily
    • Investigate error spikes
    • Monitor proxy performance trends
    • Track policy effectiveness

Tracing

  1. Sampling Strategy:

    • Use 100% sampling in development
    • Use 10-50% sampling in production
    • Adjust based on traffic volume
  2. Trace Analysis:

    • Focus on slow traces (> 1s)
    • Investigate traces with errors
    • Compare trace durations over time
    • Identify bottlenecks in span trees
  3. Performance Optimization:

    • Use trace data to identify slow operations
    • Optimize database queries that appear frequently
    • Review service call patterns
    • Identify N+1 query patterns

Security Considerations

  1. Sensitive Data:

    • Traces may contain request/response data
    • Ensure trace data is stored securely
    • Consider data retention policies
    • Sanitize sensitive information in spans
  2. Access Control:

    • Restrict access to metrics endpoints in production
    • Use authentication for Prometheus endpoint
    • Limit trace export to authorized backends

Troubleshooting

Metrics Not Appearing

Problem: Metrics endpoint returns empty or no data.

Solutions:

  1. Verify metrics middleware is registered (should be automatic)
  2. Check that requests are being made (metrics are request-driven)
  3. Ensure sufficient time has passed for metrics to accumulate
  4. Check logs for metrics-related errors

Prometheus Scraping Fails

Problem: Prometheus cannot scrape metrics endpoint.

Solutions:

  1. Verify endpoint is accessible: curl http://localhost:8080/api/v1/metrics/prometheus
  2. Check Prometheus configuration (correct URL and path)
  3. Verify network connectivity between Prometheus and gateway
  4. Check for authentication requirements

Traces Not Appearing in Jaeger

Problem: Traces are not showing up in Jaeger UI.

Solutions:

  1. Verify TRACING_ENABLED=true is set
  2. Check Jaeger URL is correct and accessible
  3. Verify Jaeger collector is running: docker ps | grep jaeger
  4. Check gateway logs for tracing initialization errors
  5. Verify sampling rate is not too low
  6. Wait a few seconds for traces to be exported (batched)

High Memory Usage from Tracing

Problem: Tracing causes high memory usage.

Solutions:

  1. Reduce sampling rate: export TRACING_SAMPLE_RATE=0.1
  2. Ensure exporter is running and consuming traces
  3. Check for span leaks (spans not being ended)
  4. Restart gateway if memory usage is excessive

Missing Trace Context

Problem: Trace context is not propagated through request chain.

Solutions:

  1. Ensure middleware is registered before other middleware
  2. Verify context.Context is passed through all function calls
  3. Check that child spans are created from parent context
  4. Review code to ensure context propagation is maintained

Connection Pool Alerts

Problem: Receiving connection pool exhaustion alerts.

Solutions:

  1. Check connection pool stats endpoint for details
  2. Review database query patterns for long-running queries
  3. Increase MaxOpenConns in database configuration if needed
  4. Optimize slow queries
  5. Check for connection leaks (connections not being closed)

Example Use Cases

Use Case 1: Debugging Slow API Endpoint

Scenario: /api/v1/proxies endpoint is slow.

Steps:

  1. Check metrics: GET /api/v1/metrics → Look at request_metrics.duration_p95 for /api/v1/proxies
  2. View trace in Jaeger: Search for traces with operation GET /api/v1/proxies
  3. Analyze span tree: Identify which service/repository call is slow
  4. Review database queries: Check db.query spans for slow queries
  5. Optimize: Fix slow queries or add caching

Use Case 2: Monitoring Proxy Performance

Scenario: Monitor performance of specific proxy instances.

Steps:

  1. Query proxy metrics: GET /api/v1/metricsproxy_metrics[proxy_id]
  2. Set up Grafana dashboard: Create panel for gateway_proxy_requests_total{proxy_id="1"}
  3. Configure alerts: Alert when error rate > 5% or duration > 500ms
  4. Review traces: Filter traces by proxy.id attribute

Use Case 3: Policy Effectiveness Analysis

Scenario: Analyze which policies are most effective.

Steps:

  1. Query policy metrics: GET /api/v1/metricspolicy_metrics.policy_details
  2. Calculate hit rates: matches / evaluations * 100
  3. Identify top policies: Sort by match count or hit rate
  4. Review policy traces: Filter traces by policy.name attribute
  5. Optimize: Tune policies with low hit rates or high evaluation times

Use Case 4: Database Performance Tuning

Scenario: Optimize database query performance.

Steps:

  1. Check database metrics: GET /api/v1/metricsdatabase_metrics
  2. Identify slow queries: Review traces with db.query spans > 100ms
  3. Analyze query patterns: Look for N+1 query patterns in traces
  4. Optimize: Add indexes, batch queries, or add caching
  5. Monitor improvements: Track database_metrics.duration_p95 over time

Quick Reference

Environment Variables

bash
# Metrics (always enabled, no config needed)

# Tracing
export TRACING_ENABLED=true
export TRACING_SERVICE_NAME=ai-security-gateway
export TRACING_ENVIRONMENT=production
export TRACING_JAEGER_URL=http://localhost:14268/api/traces
# OR
export TRACING_OTLP_ENDPOINT=http://localhost:4318
export TRACING_SAMPLE_RATE=0.1

API Endpoints

EndpointMethodDescriptionAuth Required
/api/v1/metricsGETGet all metrics (JSON)Yes
/api/v1/metrics/prometheusGETPrometheus metricsNo
/api/v1/system/db/pool/healthGETConnection pool healthYes
/api/v1/system/db/pool/statsGETConnection pool statsYes

Docker Compose Example

yaml
version: '3.8'

services:
  gateway:
    image: ai-security-gateway:latest
    ports:
      - "8080:8080"
    environment:
      - TRACING_ENABLED=true
      - TRACING_JAEGER_URL=http://jaeger:14268/api/traces
      - TRACING_SERVICE_NAME=ai-security-gateway
      - TRACING_ENVIRONMENT=production
      - TRACING_SAMPLE_RATE=0.1

  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "16686:16686"  # UI
      - "14268:14268"  # Collector

  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin

Support

For additional help:

  • Review docs/tracing-guide.md for detailed tracing documentation
  • Check gateway logs for errors
  • Review metrics endpoint responses for system status
  • Consult OpenTelemetry documentation for advanced tracing configuration