Observability & Monitoring User Guide
Observability Overview
The AI Security Gateway provides comprehensive observability and monitoring capabilities through enhanced metrics and distributed tracing. This guide explains how to configure, access, and leverage these features to monitor system health, performance, and troubleshoot issues.
Observability Table of Contents
Enhanced Metrics
Metrics Overview
Enhanced metrics provide detailed insights into system performance, including request durations, error rates, database query performance, proxy metrics, and policy evaluation statistics. All metrics are available via REST API and Prometheus-compatible endpoints.
Metrics Configuration
Enhanced metrics are enabled by default and require no additional configuration. They automatically track:
- Request Metrics: Total requests, errors, slow requests, duration percentiles (p50, p95, p99)
- Database Metrics: Total queries, slow queries, duration percentiles
- Error Metrics: Breakdown by endpoint and status code
- Proxy Metrics: Per-proxy request counts, success/error rates, duration
- Policy Metrics: Total evaluations, matches, misses, hit rate, per-policy details
- System Metrics: CPU, memory, goroutines
- Connection Pool Metrics: Utilization, wait counts, connection states
- Cache Metrics: Hits, misses, hit rate
Accessing Metrics
1. REST API Endpoint
Endpoint: GET /api/v1/metrics
Example Request:
curl -X GET http://localhost:8080/api/v1/metrics \
-H "Authorization: Bearer YOUR_JWT_TOKEN"Response Structure:
{
"success": true,
"data": {
"request_metrics": {
"total_requests": 1250,
"total_errors": 23,
"slow_requests": 5,
"duration_p50": "45ms",
"duration_p95": "120ms",
"duration_p99": "250ms"
},
"database_metrics": {
"total_queries": 3450,
"slow_queries": 12,
"duration_p50": "2ms",
"duration_p95": "8ms",
"duration_p99": "15ms"
},
"error_metrics": {
"/api/v1/proxies": {
"400": 5,
"404": 3,
"500": 2
}
},
"proxy_metrics": {
"1": {
"proxy_id": 1,
"proxy_name": "MCP Server 1",
"proxy_type": "mcp",
"requests_total": 450,
"requests_success": 435,
"requests_error": 15,
"request_duration": "125ms",
"last_request_time": "2025-01-15T10:30:00Z"
}
},
"policy_metrics": {
"total_evaluations": 12500,
"matches": 125,
"misses": 12375,
"average_duration": "2ms",
"hit_rate_percent": 1.0,
"policy_details": {
"malicious-prompt-detection": {
"policy_name": "malicious-prompt-detection",
"evaluations": 5000,
"matches": 50,
"average_duration": "1.5ms"
}
}
},
"system_metrics": {
"cpu_usage_percent": 25.5,
"memory_usage_mb": 512,
"goroutine_count": 45
},
"connection_pool_metrics": {
"open_connections": 10,
"in_use": 3,
"idle": 7,
"wait_count": 0,
"utilization_percent": 30.0
},
"cache_metrics": {
"hits": 1250,
"misses": 250,
"hit_rate": 83.3
}
}
}2. Prometheus Endpoint
Endpoint: GET /api/v1/metrics/prometheus
Example Request:
curl -X GET http://localhost:8080/api/v1/metrics/prometheusResponse Format (Prometheus text format):
# HELP gateway_requests_total Total number of HTTP requests
# TYPE gateway_requests_total counter
gateway_requests_total 1250
# HELP gateway_request_duration_seconds Request duration in seconds
# TYPE gateway_request_duration_seconds histogram
gateway_request_duration_seconds_bucket{le="0.005"} 100
gateway_request_duration_seconds_bucket{le="0.01"} 500
gateway_request_duration_seconds_bucket{le="0.05"} 1000
gateway_request_duration_seconds_bucket{le="0.1"} 1200
gateway_request_duration_seconds_bucket{le="+Inf"} 1250
gateway_request_duration_seconds_sum 45.2
gateway_request_duration_seconds_count 1250
# HELP gateway_errors_total Total number of errors by endpoint and status
# TYPE gateway_errors_total counter
gateway_errors_total{endpoint="/api/v1/proxies",status="400"} 5
gateway_errors_total{endpoint="/api/v1/proxies",status="500"} 2
# HELP gateway_proxy_requests_total Total proxy requests by proxy
# TYPE gateway_proxy_requests_total counter
gateway_proxy_requests_total{proxy_id="1",proxy_name="MCP Server 1",proxy_type="mcp"} 450
# HELP gateway_policy_evaluations_total Total policy evaluations
# TYPE gateway_policy_evaluations_total counter
gateway_policy_evaluations_total 12500
gateway_policy_evaluations_matches_total 125
gateway_policy_evaluations_misses_total 12375
# HELP gateway_db_queries_total Total database queries
# TYPE gateway_db_queries_total counter
gateway_db_queries_total 34503. Connection Pool Health Endpoint
Endpoint: GET /api/v1/system/db/pool/health
Example Request:
curl -X GET http://localhost:8080/api/v1/system/db/pool/health \
-H "Authorization: Bearer YOUR_JWT_TOKEN"Response:
{
"success": true,
"data": {
"status": "healthy",
"utilization_percent": 30.0,
"open_connections": 10,
"in_use": 3,
"idle": 7,
"wait_count": 0
}
}Status Values:
healthy- Normal operation (< 80% utilization, no waits)degraded- Elevated utilization (80-90%) or occasional waitscritical- High utilization (> 90%) or excessive waits
4. Connection Pool Stats Endpoint
Endpoint: GET /api/v1/system/db/pool/stats
Example Request:
curl -X GET http://localhost:8080/api/v1/system/db/pool/stats \
-H "Authorization: Bearer YOUR_JWT_TOKEN"Response:
{
"success": true,
"data": {
"max_open_connections": 25,
"open_connections": 10,
"in_use": 3,
"idle": 7,
"wait_count": 0,
"wait_duration_ns": 0,
"max_idle_closed": 0,
"max_lifetime_closed": 0,
"utilization_percent": 30.0
}
}Metrics Dashboard Integration
Grafana Dashboard Setup
Configure Prometheus Data Source:
- Add Prometheus as a data source in Grafana
- URL:
http://your-gateway:8080/api/v1/metrics/prometheus
Example Queries:
Request Rate:
promqlrate(gateway_requests_total[5m])Error Rate:
promqlrate(gateway_errors_total[5m])Request Duration (p95):
promqlhistogram_quantile(0.95, rate(gateway_request_duration_seconds_bucket[5m]))Proxy Request Rate:
promqlrate(gateway_proxy_requests_total[5m])Policy Hit Rate:
promqlrate(gateway_policy_evaluations_matches_total[5m]) / rate(gateway_policy_evaluations_total[5m]) * 100Database Query Duration (p99):
promqlhistogram_quantile(0.99, rate(gateway_db_query_duration_seconds_bucket[5m]))Connection Pool Utilization:
promqlgateway_db_pool_utilization_percentRecommended Dashboard Panels:
- Request rate (requests/second)
- Error rate by endpoint
- Request duration percentiles (p50, p95, p99)
- Proxy request rates by proxy
- Policy evaluation metrics
- Database query performance
- Connection pool utilization
- System resource usage (CPU, memory, goroutines)
Using Metrics for Monitoring
Alerting Rules (Prometheus)
Example alerting rules you can configure in Prometheus:
groups:
- name: gateway_alerts
rules:
# High error rate
- alert: HighErrorRate
expr: rate(gateway_errors_total[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }} errors/second"
# Slow requests
- alert: SlowRequests
expr: histogram_quantile(0.95, rate(gateway_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "Slow requests detected"
description: "95th percentile request duration is {{ $value }}s"
# Connection pool exhaustion
- alert: ConnectionPoolExhaustion
expr: gateway_db_pool_utilization_percent > 90
for: 2m
labels:
severity: critical
annotations:
summary: "Connection pool near exhaustion"
description: "Pool utilization is {{ $value }}%"
# High policy evaluation time
- alert: SlowPolicyEvaluation
expr: gateway_policy_evaluation_duration_seconds > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Slow policy evaluation"
description: "Policy evaluation taking {{ $value }}s"Distributed Tracing
Tracing Overview
Distributed tracing provides end-to-end visibility into request flows across handlers, services, and repositories. This helps debug issues, understand performance bottlenecks, and trace request paths through the system.
Tracing Configuration
Tracing is disabled by default and must be explicitly enabled via environment variables.
Environment Variables
| Variable | Description | Default | Example |
|---|---|---|---|
TRACING_ENABLED | Enable/disable tracing | false | true |
TRACING_SERVICE_NAME | Service name for traces | ai-security-gateway | ai-security-gateway |
TRACING_ENVIRONMENT | Environment name | development | production |
TRACING_JAEGER_URL | Jaeger collector endpoint | - | http://localhost:14268/api/traces |
TRACING_OTLP_ENDPOINT | OTLP HTTP endpoint | - | http://localhost:4318 |
TRACING_SAMPLE_RATE | Sampling rate (0.0-1.0) | 1.0 | 0.1 (10% sampling) |
Setup Instructions
Option 1: Jaeger (Quick Start)
Start Jaeger:
bashdocker run -d \ --name jaeger \ -p 16686:16686 \ -p 14268:14268 \ jaegertracing/all-in-one:latestConfigure Gateway:
bashexport TRACING_ENABLED=true export TRACING_JAEGER_URL=http://localhost:14268/api/traces export TRACING_SERVICE_NAME=ai-security-gateway export TRACING_ENVIRONMENT=developmentView Traces:
- Open Jaeger UI: http://localhost:16686
- Select service:
ai-security-gateway - Search for traces
Option 2: OTLP (OpenTelemetry Protocol)
Start OTLP Collector:
bashdocker run -d \ --name otel-collector \ -p 4318:4318 \ -v /path/to/otel-collector-config.yaml:/etc/otel-collector-config.yaml \ otel/opentelemetry-collector:latest \ --config=/etc/otel-collector-config.yamlConfigure Gateway:
bashexport TRACING_ENABLED=true export TRACING_OTLP_ENDPOINT=http://localhost:4318 export TRACING_SERVICE_NAME=ai-security-gateway export TRACING_ENVIRONMENT=productionOTLP Collector Configuration (
otel-collector-config.yaml):yamlreceivers: otlp: protocols: http: endpoint: 0.0.0.0:4318 exporters: jaeger: endpoint: jaeger:14250 logging: loglevel: debug service: pipelines: traces: receivers: [otlp] exporters: [jaeger, logging]
How Tracing Works
Automatic Tracing
All HTTP requests are automatically traced via middleware. No additional code is required for basic request tracing.
What's Captured:
- Request method, URL, path
- Request headers (user agent, remote address)
- Response status code
- Response size
- Request duration
- Errors (if any)
Trace Context Propagation
Trace context is automatically propagated through:
- HTTP handlers → Services → Repositories
- All function calls that accept
context.Context - Database queries
- Policy evaluations
- Proxy requests
Trace Structure
A typical trace looks like this:
http.request (root span)
├── service.MultiProxyService.ListProxyConfigs
│ └── repo.ProxyConfigRepository.List
│ └── db.query (SELECT * FROM proxy_configs...)
├── service.DashboardService.GetPolicies
│ └── repo.PolicyRepository.GetActiveStatusBatch
│ └── db.query (SELECT name, active FROM policies...)
└── service.PolicyAssignmentService.GetEnabledAssignments
└── repo.PolicyAssignmentRepository.GetByProxyID
└── db.query (SELECT * FROM policy_assignments...)Viewing Traces
Jaeger UI
Access Jaeger UI: http://localhost:16686
Search for Traces:
- Select service:
ai-security-gateway - Choose time range
- Optionally filter by operation, tags, or duration
- Select service:
View Trace Details:
- Click on a trace to see the full span tree
- View span attributes (request details, errors, etc.)
- See timing breakdown for each operation
Example Trace View:
Trace: abc123def456 Duration: 45ms [http.request] 45ms ├─ [service.MultiProxyService.ListProxyConfigs] 30ms │ └─ [repo.ProxyConfigRepository.List] 25ms │ └─ [db.query] 20ms └─ [service.DashboardService.GetPolicies] 10ms └─ [repo.PolicyRepository.GetActiveStatusBatch] 8ms
Trace Attributes
Each span includes relevant attributes:
HTTP Request Spans:
http.method: HTTP method (GET, POST, etc.)http.url: Full request URLhttp.route: Route pathhttp.status_code: Response status codehttp.response.size: Response size in byteshttp.user_agent: Client user agenthttp.remote_addr: Client IP address
Service Spans:
service.name: Service nameservice.method: Method name
Repository Spans:
repository.name: Repository namerepository.method: Method name
Database Spans:
db.operation: Operation type (SELECT, INSERT, etc.)db.statement: SQL query (sanitized)db.sql.table: Table name
Policy Spans:
policy.name: Policy namepolicy.matched: Whether policy matched
Proxy Spans:
proxy.id: Proxy IDproxy.name: Proxy nameproxy.type: Proxy type (mcp, llm)
Sampling Configuration
For high-traffic scenarios, configure sampling to reduce trace volume:
# Sample 10% of requests
export TRACING_SAMPLE_RATE=0.1
# Sample 50% of requests
export TRACING_SAMPLE_RATE=0.5
# Sample all requests (default)
export TRACING_SAMPLE_RATE=1.0Sampling Strategy:
- Use
1.0(100%) for development and low-traffic production - Use
0.1(10%) for high-traffic production - Use
0.01(1%) for very high-traffic scenarios
Trace Context in Logs
Trace IDs and Span IDs are automatically included in structured logs when available. This allows correlating logs with traces.
Example Log Entry:
2025-01-15 10:30:00 INFO [api-server] Failed to list proxy configurations: database connection timeout
trace_id=abc123def456
span_id=789xyz012Extracting Trace Context (for custom logging):
import "github.com/syphon1c/ai-security-gateway/internal/tracing"
traceID := tracing.TraceIDFromContext(ctx)
spanID := tracing.SpanIDFromContext(ctx)
logger.Info("Operation completed", "trace_id", traceID, "span_id", spanID)Integration with Monitoring Systems
Prometheus + Grafana
1. Configure Prometheus
prometheus.yml:
scrape_configs:
- job_name: 'ai-security-gateway'
scrape_interval: 15s
metrics_path: '/api/v1/metrics/prometheus'
static_configs:
- targets: ['localhost:8080']2. Start Prometheus
docker run -d \
--name prometheus \
-p 9090:9090 \
-v /path/to/prometheus.yml:/etc/prometheus/prometheus.yml \
prom/prometheus:latest3. Create Grafana Dashboard
Import the following dashboard JSON or create panels manually:
Key Panels:
- Request rate over time
- Error rate by endpoint
- Request duration percentiles
- Proxy metrics by proxy
- Policy evaluation metrics
- Database query performance
- Connection pool health
- System resource usage
Datadog Integration
1. Configure Datadog OTLP Exporter
export TRACING_ENABLED=true
export TRACING_OTLP_ENDPOINT=https://trace-intake.datadoghq.com
export TRACING_SERVICE_NAME=ai-security-gateway2. Configure Datadog Agent
The Datadog agent will receive traces via OTLP and forward them to Datadog.
New Relic Integration
1. Configure New Relic OTLP Exporter
export TRACING_ENABLED=true
export TRACING_OTLP_ENDPOINT=https://otlp.nr-data.net:4318
export TRACING_SERVICE_NAME=ai-security-gateway2. Add API Key Header
Modify internal/tracing/tracer.go to include New Relic API key in OTLP headers.
Custom Observability Backend
Any OTLP-compatible backend can be used:
export TRACING_ENABLED=true
export TRACING_OTLP_ENDPOINT=http://your-otel-backend:4318Best Practices
Metrics
Monitor Key Metrics:
- Request rate and error rate
- Request duration percentiles (p95, p99)
- Database query performance
- Connection pool utilization
- Policy evaluation performance
Set Up Alerts:
- High error rates (> 1% of requests)
- Slow requests (p95 > 1s)
- Connection pool exhaustion (> 90% utilization)
- High database query times
Regular Review:
- Review metrics dashboard daily
- Investigate error spikes
- Monitor proxy performance trends
- Track policy effectiveness
Tracing
Sampling Strategy:
- Use 100% sampling in development
- Use 10-50% sampling in production
- Adjust based on traffic volume
Trace Analysis:
- Focus on slow traces (> 1s)
- Investigate traces with errors
- Compare trace durations over time
- Identify bottlenecks in span trees
Performance Optimization:
- Use trace data to identify slow operations
- Optimize database queries that appear frequently
- Review service call patterns
- Identify N+1 query patterns
Security Considerations
Sensitive Data:
- Traces may contain request/response data
- Ensure trace data is stored securely
- Consider data retention policies
- Sanitize sensitive information in spans
Access Control:
- Restrict access to metrics endpoints in production
- Use authentication for Prometheus endpoint
- Limit trace export to authorized backends
Troubleshooting
Metrics Not Appearing
Problem: Metrics endpoint returns empty or no data.
Solutions:
- Verify metrics middleware is registered (should be automatic)
- Check that requests are being made (metrics are request-driven)
- Ensure sufficient time has passed for metrics to accumulate
- Check logs for metrics-related errors
Prometheus Scraping Fails
Problem: Prometheus cannot scrape metrics endpoint.
Solutions:
- Verify endpoint is accessible:
curl http://localhost:8080/api/v1/metrics/prometheus - Check Prometheus configuration (correct URL and path)
- Verify network connectivity between Prometheus and gateway
- Check for authentication requirements
Traces Not Appearing in Jaeger
Problem: Traces are not showing up in Jaeger UI.
Solutions:
- Verify
TRACING_ENABLED=trueis set - Check Jaeger URL is correct and accessible
- Verify Jaeger collector is running:
docker ps | grep jaeger - Check gateway logs for tracing initialization errors
- Verify sampling rate is not too low
- Wait a few seconds for traces to be exported (batched)
High Memory Usage from Tracing
Problem: Tracing causes high memory usage.
Solutions:
- Reduce sampling rate:
export TRACING_SAMPLE_RATE=0.1 - Ensure exporter is running and consuming traces
- Check for span leaks (spans not being ended)
- Restart gateway if memory usage is excessive
Missing Trace Context
Problem: Trace context is not propagated through request chain.
Solutions:
- Ensure middleware is registered before other middleware
- Verify
context.Contextis passed through all function calls - Check that child spans are created from parent context
- Review code to ensure context propagation is maintained
Connection Pool Alerts
Problem: Receiving connection pool exhaustion alerts.
Solutions:
- Check connection pool stats endpoint for details
- Review database query patterns for long-running queries
- Increase
MaxOpenConnsin database configuration if needed - Optimize slow queries
- Check for connection leaks (connections not being closed)
Example Use Cases
Use Case 1: Debugging Slow API Endpoint
Scenario: /api/v1/proxies endpoint is slow.
Steps:
- Check metrics:
GET /api/v1/metrics→ Look atrequest_metrics.duration_p95for/api/v1/proxies - View trace in Jaeger: Search for traces with operation
GET /api/v1/proxies - Analyze span tree: Identify which service/repository call is slow
- Review database queries: Check
db.queryspans for slow queries - Optimize: Fix slow queries or add caching
Use Case 2: Monitoring Proxy Performance
Scenario: Monitor performance of specific proxy instances.
Steps:
- Query proxy metrics:
GET /api/v1/metrics→proxy_metrics[proxy_id] - Set up Grafana dashboard: Create panel for
gateway_proxy_requests_total{proxy_id="1"} - Configure alerts: Alert when error rate > 5% or duration > 500ms
- Review traces: Filter traces by
proxy.idattribute
Use Case 3: Policy Effectiveness Analysis
Scenario: Analyze which policies are most effective.
Steps:
- Query policy metrics:
GET /api/v1/metrics→policy_metrics.policy_details - Calculate hit rates:
matches / evaluations * 100 - Identify top policies: Sort by match count or hit rate
- Review policy traces: Filter traces by
policy.nameattribute - Optimize: Tune policies with low hit rates or high evaluation times
Use Case 4: Database Performance Tuning
Scenario: Optimize database query performance.
Steps:
- Check database metrics:
GET /api/v1/metrics→database_metrics - Identify slow queries: Review traces with
db.queryspans > 100ms - Analyze query patterns: Look for N+1 query patterns in traces
- Optimize: Add indexes, batch queries, or add caching
- Monitor improvements: Track
database_metrics.duration_p95over time
Quick Reference
Environment Variables
# Metrics (always enabled, no config needed)
# Tracing
export TRACING_ENABLED=true
export TRACING_SERVICE_NAME=ai-security-gateway
export TRACING_ENVIRONMENT=production
export TRACING_JAEGER_URL=http://localhost:14268/api/traces
# OR
export TRACING_OTLP_ENDPOINT=http://localhost:4318
export TRACING_SAMPLE_RATE=0.1API Endpoints
| Endpoint | Method | Description | Auth Required |
|---|---|---|---|
/api/v1/metrics | GET | Get all metrics (JSON) | Yes |
/api/v1/metrics/prometheus | GET | Prometheus metrics | No |
/api/v1/system/db/pool/health | GET | Connection pool health | Yes |
/api/v1/system/db/pool/stats | GET | Connection pool stats | Yes |
Docker Compose Example
version: '3.8'
services:
gateway:
image: ai-security-gateway:latest
ports:
- "8080:8080"
environment:
- TRACING_ENABLED=true
- TRACING_JAEGER_URL=http://jaeger:14268/api/traces
- TRACING_SERVICE_NAME=ai-security-gateway
- TRACING_ENVIRONMENT=production
- TRACING_SAMPLE_RATE=0.1
jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "16686:16686" # UI
- "14268:14268" # Collector
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=adminSupport
For additional help:
- Review
docs/tracing-guide.mdfor detailed tracing documentation - Check gateway logs for errors
- Review metrics endpoint responses for system status
- Consult OpenTelemetry documentation for advanced tracing configuration