Observability & Monitoring User Guide

Observability Overview

The AI Security Gateway provides comprehensive observability and monitoring capabilities through enhanced metrics and distributed tracing. This guide explains how to configure, access, and leverage these features to monitor system health, performance, and troubleshoot issues.

Enhanced Metrics

Metrics Overview

Enhanced metrics provide detailed insights into system performance, including request durations, error rates, database query performance, proxy metrics, and policy evaluation statistics. All metrics are available via REST API and Prometheus-compatible endpoints.

Metrics Configuration

Enhanced metrics are enabled by default and require no additional configuration. They automatically track:

Request Metrics: Total requests, errors, slow requests, duration percentiles (p50, p95, p99)
Database Metrics: Total queries, slow queries, duration percentiles
Error Metrics: Breakdown by endpoint and status code
Proxy Metrics: Per-proxy request counts, success/error rates, duration
Policy Metrics: Total evaluations, matches, misses, hit rate, per-policy details
System Metrics: CPU, memory, goroutines
Connection Pool Metrics: Utilization, wait counts, connection states
Cache Metrics: Hits, misses, hit rate

Accessing Metrics

1. REST API Endpoint

Endpoint: GET /api/v1/metrics

Example Request:

bash

curl -X GET http://localhost:8080/api/v1/metrics \
  -H "Authorization: Bearer YOUR_JWT_TOKEN"

Response Structure:

json

{
  "success": true,
  "data": {
    "request_metrics": {
      "total_requests": 1250,
      "total_errors": 23,
      "slow_requests": 5,
      "duration_p50": "45ms",
      "duration_p95": "120ms",
      "duration_p99": "250ms"
    },
    "database_metrics": {
      "total_queries": 3450,
      "slow_queries": 12,
      "duration_p50": "2ms",
      "duration_p95": "8ms",
      "duration_p99": "15ms"
    },
    "error_metrics": {
      "/api/v1/proxies": {
        "400": 5,
        "404": 3,
        "500": 2
      }
    },
    "proxy_metrics": {
      "1": {
        "proxy_id": 1,
        "proxy_name": "MCP Server 1",
        "proxy_type": "mcp",
        "requests_total": 450,
        "requests_success": 435,
        "requests_error": 15,
        "request_duration": "125ms",
        "last_request_time": "2025-01-15T10:30:00Z"
      }
    },
    "policy_metrics": {
      "total_evaluations": 12500,
      "matches": 125,
      "misses": 12375,
      "average_duration": "2ms",
      "hit_rate_percent": 1.0,
      "policy_details": {
        "malicious-prompt-detection": {
          "policy_name": "malicious-prompt-detection",
          "evaluations": 5000,
          "matches": 50,
          "average_duration": "1.5ms"
        }
      }
    },
    "system_metrics": {
      "cpu_usage_percent": 25.5,
      "memory_usage_mb": 512,
      "goroutine_count": 45
    },
    "connection_pool_metrics": {
      "open_connections": 10,
      "in_use": 3,
      "idle": 7,
      "wait_count": 0,
      "utilization_percent": 30.0
    },
    "cache_metrics": {
      "hits": 1250,
      "misses": 250,
      "hit_rate": 83.3
    }
  }
}

2. Prometheus Endpoint

Endpoint: GET /api/v1/metrics/prometheus

Example Request:

bash

curl -X GET http://localhost:8080/api/v1/metrics/prometheus

Response Format (Prometheus text format):

# HELP gateway_requests_total Total number of HTTP requests
# TYPE gateway_requests_total counter
gateway_requests_total 1250

# HELP gateway_request_duration_seconds Request duration in seconds
# TYPE gateway_request_duration_seconds histogram
gateway_request_duration_seconds_bucket{le="0.005"} 100
gateway_request_duration_seconds_bucket{le="0.01"} 500
gateway_request_duration_seconds_bucket{le="0.05"} 1000
gateway_request_duration_seconds_bucket{le="0.1"} 1200
gateway_request_duration_seconds_bucket{le="+Inf"} 1250
gateway_request_duration_seconds_sum 45.2
gateway_request_duration_seconds_count 1250

# HELP gateway_errors_total Total number of errors by endpoint and status
# TYPE gateway_errors_total counter
gateway_errors_total{endpoint="/api/v1/proxies",status="400"} 5
gateway_errors_total{endpoint="/api/v1/proxies",status="500"} 2

# HELP gateway_proxy_requests_total Total proxy requests by proxy
# TYPE gateway_proxy_requests_total counter
gateway_proxy_requests_total{proxy_id="1",proxy_name="MCP Server 1",proxy_type="mcp"} 450

# HELP gateway_policy_evaluations_total Total policy evaluations
# TYPE gateway_policy_evaluations_total counter
gateway_policy_evaluations_total 12500
gateway_policy_evaluations_matches_total 125
gateway_policy_evaluations_misses_total 12375

# HELP gateway_db_queries_total Total database queries
# TYPE gateway_db_queries_total counter
gateway_db_queries_total 3450

3. Connection Pool Health Endpoint

Endpoint: GET /api/v1/system/db/pool/health

Example Request:

bash

curl -X GET http://localhost:8080/api/v1/system/db/pool/health \
  -H "Authorization: Bearer YOUR_JWT_TOKEN"

Response:

json

{
  "success": true,
  "data": {
    "status": "healthy",
    "utilization_percent": 30.0,
    "open_connections": 10,
    "in_use": 3,
    "idle": 7,
    "wait_count": 0
  }
}

Status Values:

healthy - Normal operation (< 80% utilization, no waits)
degraded - Elevated utilization (80-90%) or occasional waits
critical - High utilization (> 90%) or excessive waits

4. Connection Pool Stats Endpoint

Endpoint: GET /api/v1/system/db/pool/stats

Example Request:

bash

curl -X GET http://localhost:8080/api/v1/system/db/pool/stats \
  -H "Authorization: Bearer YOUR_JWT_TOKEN"

Response:

json

{
  "success": true,
  "data": {
    "max_open_connections": 25,
    "open_connections": 10,
    "in_use": 3,
    "idle": 7,
    "wait_count": 0,
    "wait_duration_ns": 0,
    "max_idle_closed": 0,
    "max_lifetime_closed": 0,
    "utilization_percent": 30.0
  }
}

Metrics Dashboard Integration

Grafana Dashboard Setup

Configure Prometheus Data Source:
- Add Prometheus as a data source in Grafana
- URL: http://your-gateway:8080/api/v1/metrics/prometheus

Example Queries:

Request Rate:

promql

rate(gateway_requests_total[5m])

Error Rate:

promql

rate(gateway_errors_total[5m])

Request Duration (p95):

promql

histogram_quantile(0.95, rate(gateway_request_duration_seconds_bucket[5m]))

Proxy Request Rate:

promql

rate(gateway_proxy_requests_total[5m])

Policy Hit Rate:

promql

rate(gateway_policy_evaluations_matches_total[5m]) / rate(gateway_policy_evaluations_total[5m]) * 100

Database Query Duration (p99):

promql

histogram_quantile(0.99, rate(gateway_db_query_duration_seconds_bucket[5m]))

Connection Pool Utilization:

promql

gateway_db_pool_utilization_percent

Recommended Dashboard Panels:
- Request rate (requests/second)
- Error rate by endpoint
- Request duration percentiles (p50, p95, p99)
- Proxy request rates by proxy
- Policy evaluation metrics
- Database query performance
- Connection pool utilization
- System resource usage (CPU, memory, goroutines)

Using Metrics for Monitoring

Alerting Rules (Prometheus)

Example alerting rules you can configure in Prometheus:

yaml

groups:
  - name: gateway_alerts
    rules:
      # High error rate
      - alert: HighErrorRate
        expr: rate(gateway_errors_total[5m]) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value }} errors/second"

      # Slow requests
      - alert: SlowRequests
        expr: histogram_quantile(0.95, rate(gateway_request_duration_seconds_bucket[5m])) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Slow requests detected"
          description: "95th percentile request duration is {{ $value }}s"

      # Connection pool exhaustion
      - alert: ConnectionPoolExhaustion
        expr: gateway_db_pool_utilization_percent > 90
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Connection pool near exhaustion"
          description: "Pool utilization is {{ $value }}%"

      # High policy evaluation time
      - alert: SlowPolicyEvaluation
        expr: gateway_policy_evaluation_duration_seconds > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Slow policy evaluation"
          description: "Policy evaluation taking {{ $value }}s"

Distributed Tracing

Tracing Overview

Distributed tracing provides end-to-end visibility into request flows across handlers, services, and repositories. This helps debug issues, understand performance bottlenecks, and trace request paths through the system.

Tracing Configuration

Tracing is disabled by default and must be explicitly enabled via environment variables.

Environment Variables

Variable	Description	Default	Example
`TRACING_ENABLED`	Enable/disable tracing	`false`	`true`
`TRACING_SERVICE_NAME`	Service name for traces	`ai-security-gateway`	`ai-security-gateway`
`TRACING_ENVIRONMENT`	Environment name	`development`	`production`
`TRACING_JAEGER_URL`	Jaeger collector endpoint	-	`http://localhost:14268/api/traces`
`TRACING_OTLP_ENDPOINT`	OTLP HTTP endpoint	-	`http://localhost:4318`
`TRACING_SAMPLE_RATE`	Sampling rate (0.0-1.0)	`1.0`	`0.1` (10% sampling)

Setup Instructions

Option 1: Jaeger (Quick Start)

Start Jaeger:

bash

docker run -d \
  --name jaeger \
  -p 16686:16686 \
  -p 14268:14268 \
  jaegertracing/all-in-one:latest

Configure Gateway:

bash

export TRACING_ENABLED=true
export TRACING_JAEGER_URL=http://localhost:14268/api/traces
export TRACING_SERVICE_NAME=ai-security-gateway
export TRACING_ENVIRONMENT=development

View Traces:
- Open Jaeger UI: http://localhost:16686
- Select service: ai-security-gateway
- Search for traces

Option 2: OTLP (OpenTelemetry Protocol)

Start OTLP Collector:

bash

docker run -d \
  --name otel-collector \
  -p 4318:4318 \
  -v /path/to/otel-collector-config.yaml:/etc/otel-collector-config.yaml \
  otel/opentelemetry-collector:latest \
  --config=/etc/otel-collector-config.yaml

Configure Gateway:

bash

export TRACING_ENABLED=true
export TRACING_OTLP_ENDPOINT=http://localhost:4318
export TRACING_SERVICE_NAME=ai-security-gateway
export TRACING_ENVIRONMENT=production

OTLP Collector Configuration (otel-collector-config.yaml):

yaml

receivers:
  otlp:
    protocols:
      http:
        endpoint: 0.0.0.0:4318

exporters:
  jaeger:
    endpoint: jaeger:14250
  logging:
    loglevel: debug

service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [jaeger, logging]

How Tracing Works

Automatic Tracing

All HTTP requests are automatically traced via middleware. No additional code is required for basic request tracing.

What's Captured:

Request method, URL, path
Request headers (user agent, remote address)
Response status code
Response size
Request duration
Errors (if any)

Trace Context Propagation

Trace context is automatically propagated through:

HTTP handlers → Services → Repositories
All function calls that accept context.Context
Database queries
Policy evaluations
Proxy requests

Trace Structure

A typical trace looks like this:

http.request (root span)
├── service.MultiProxyService.ListProxyConfigs
│   └── repo.ProxyConfigRepository.List
│       └── db.query (SELECT * FROM proxy_configs...)
├── service.DashboardService.GetPolicies
│   └── repo.PolicyRepository.GetActiveStatusBatch
│       └── db.query (SELECT name, active FROM policies...)
└── service.PolicyAssignmentService.GetEnabledAssignments
    └── repo.PolicyAssignmentRepository.GetByProxyID
        └── db.query (SELECT * FROM policy_assignments...)

Viewing Traces

Jaeger UI

Access Jaeger UI: http://localhost:16686
Search for Traces:
- Select service: ai-security-gateway
- Choose time range
- Optionally filter by operation, tags, or duration
View Trace Details:
- Click on a trace to see the full span tree
- View span attributes (request details, errors, etc.)
- See timing breakdown for each operation

Example Trace View:

Trace: abc123def456
Duration: 45ms

[http.request] 45ms
  ├─ [service.MultiProxyService.ListProxyConfigs] 30ms
  │   └─ [repo.ProxyConfigRepository.List] 25ms
  │       └─ [db.query] 20ms
  └─ [service.DashboardService.GetPolicies] 10ms
      └─ [repo.PolicyRepository.GetActiveStatusBatch] 8ms

Trace Attributes

Each span includes relevant attributes:

HTTP Request Spans:

http.method: HTTP method (GET, POST, etc.)
http.url: Full request URL
http.route: Route path
http.status_code: Response status code
http.response.size: Response size in bytes
http.user_agent: Client user agent
http.remote_addr: Client IP address

Service Spans:

service.name: Service name
service.method: Method name

Repository Spans:

repository.name: Repository name
repository.method: Method name

Database Spans:

db.operation: Operation type (SELECT, INSERT, etc.)
db.statement: SQL query (sanitized)
db.sql.table: Table name

Policy Spans:

policy.name: Policy name
policy.matched: Whether policy matched

Proxy Spans:

proxy.id: Proxy ID
proxy.name: Proxy name
proxy.type: Proxy type (mcp, llm)

Sampling Configuration

For high-traffic scenarios, configure sampling to reduce trace volume:

bash

# Sample 10% of requests
export TRACING_SAMPLE_RATE=0.1

# Sample 50% of requests
export TRACING_SAMPLE_RATE=0.5

# Sample all requests (default)
export TRACING_SAMPLE_RATE=1.0

Sampling Strategy:

Use 1.0 (100%) for development and low-traffic production
Use 0.1 (10%) for high-traffic production
Use 0.01 (1%) for very high-traffic scenarios

Trace Context in Logs

Trace IDs and Span IDs are automatically included in structured logs when available. This allows correlating logs with traces.

Example Log Entry:

2025-01-15 10:30:00 INFO [api-server] Failed to list proxy configurations: database connection timeout
  trace_id=abc123def456
  span_id=789xyz012

Extracting Trace Context (for custom logging):

import "github.com/syphon1c/ai-security-gateway/internal/tracing"

traceID := tracing.TraceIDFromContext(ctx)
spanID := tracing.SpanIDFromContext(ctx)
logger.Info("Operation completed", "trace_id", traceID, "span_id", spanID)

Integration with Monitoring Systems

Prometheus + Grafana

1. Configure Prometheus

prometheus.yml:

yaml

scrape_configs:
  - job_name: 'ai-security-gateway'
    scrape_interval: 15s
    metrics_path: '/api/v1/metrics/prometheus'
    static_configs:
      - targets: ['localhost:8080']

2. Start Prometheus

bash

docker run -d \
  --name prometheus \
  -p 9090:9090 \
  -v /path/to/prometheus.yml:/etc/prometheus/prometheus.yml \
  prom/prometheus:latest

3. Create Grafana Dashboard

Import the following dashboard JSON or create panels manually:

Key Panels:

Request rate over time
Error rate by endpoint
Request duration percentiles
Proxy metrics by proxy
Policy evaluation metrics
Database query performance
Connection pool health
System resource usage

Datadog Integration

1. Configure Datadog OTLP Exporter

bash

export TRACING_ENABLED=true
export TRACING_OTLP_ENDPOINT=https://trace-intake.datadoghq.com
export TRACING_SERVICE_NAME=ai-security-gateway

2. Configure Datadog Agent

The Datadog agent will receive traces via OTLP and forward them to Datadog.

New Relic Integration

1. Configure New Relic OTLP Exporter

bash

export TRACING_ENABLED=true
export TRACING_OTLP_ENDPOINT=https://otlp.nr-data.net:4318
export TRACING_SERVICE_NAME=ai-security-gateway

2. Add API Key Header

Modify internal/tracing/tracer.go to include New Relic API key in OTLP headers.

Custom Observability Backend

Any OTLP-compatible backend can be used:

bash

export TRACING_ENABLED=true
export TRACING_OTLP_ENDPOINT=http://your-otel-backend:4318

Best Practices

Metrics

Monitor Key Metrics:
- Request rate and error rate
- Request duration percentiles (p95, p99)
- Database query performance
- Connection pool utilization
- Policy evaluation performance
Set Up Alerts:
- High error rates (> 1% of requests)
- Slow requests (p95 > 1s)
- Connection pool exhaustion (> 90% utilization)
- High database query times
Regular Review:
- Review metrics dashboard daily
- Investigate error spikes
- Monitor proxy performance trends
- Track policy effectiveness

Tracing

Sampling Strategy:
- Use 100% sampling in development
- Use 10-50% sampling in production
- Adjust based on traffic volume
Trace Analysis:
- Focus on slow traces (> 1s)
- Investigate traces with errors
- Compare trace durations over time
- Identify bottlenecks in span trees
Performance Optimization:
- Use trace data to identify slow operations
- Optimize database queries that appear frequently
- Review service call patterns
- Identify N+1 query patterns

Security Considerations

Sensitive Data:
- Traces may contain request/response data
- Ensure trace data is stored securely
- Consider data retention policies
- Sanitize sensitive information in spans
Access Control:
- Restrict access to metrics endpoints in production
- Use authentication for Prometheus endpoint
- Limit trace export to authorized backends

Troubleshooting

Metrics Not Appearing

Problem: Metrics endpoint returns empty or no data.

Solutions:

Verify metrics middleware is registered (should be automatic)
Check that requests are being made (metrics are request-driven)
Ensure sufficient time has passed for metrics to accumulate
Check logs for metrics-related errors

Prometheus Scraping Fails

Problem: Prometheus cannot scrape metrics endpoint.

Solutions:

Verify endpoint is accessible: curl http://localhost:8080/api/v1/metrics/prometheus
Check Prometheus configuration (correct URL and path)
Verify network connectivity between Prometheus and gateway
Check for authentication requirements

Traces Not Appearing in Jaeger

Problem: Traces are not showing up in Jaeger UI.

Solutions:

Verify TRACING_ENABLED=true is set
Check Jaeger URL is correct and accessible
Verify Jaeger collector is running: docker ps | grep jaeger
Check gateway logs for tracing initialization errors
Verify sampling rate is not too low
Wait a few seconds for traces to be exported (batched)

High Memory Usage from Tracing

Problem: Tracing causes high memory usage.

Solutions:

Reduce sampling rate: export TRACING_SAMPLE_RATE=0.1
Ensure exporter is running and consuming traces
Check for span leaks (spans not being ended)
Restart gateway if memory usage is excessive

Missing Trace Context

Problem: Trace context is not propagated through request chain.

Solutions:

Ensure middleware is registered before other middleware
Verify context.Context is passed through all function calls
Check that child spans are created from parent context
Review code to ensure context propagation is maintained

Connection Pool Alerts

Problem: Receiving connection pool exhaustion alerts.

Solutions:

Check connection pool stats endpoint for details
Review database query patterns for long-running queries
Increase MaxOpenConns in database configuration if needed
Optimize slow queries
Check for connection leaks (connections not being closed)

Example Use Cases

Use Case 1: Debugging Slow API Endpoint

Scenario: /api/v1/proxies endpoint is slow.

Steps:

Check metrics: GET /api/v1/metrics → Look at request_metrics.duration_p95 for /api/v1/proxies
View trace in Jaeger: Search for traces with operation GET /api/v1/proxies
Analyze span tree: Identify which service/repository call is slow
Review database queries: Check db.query spans for slow queries
Optimize: Fix slow queries or add caching

Use Case 2: Monitoring Proxy Performance

Scenario: Monitor performance of specific proxy instances.

Steps:

Query proxy metrics: GET /api/v1/metrics → proxy_metrics[proxy_id]
Set up Grafana dashboard: Create panel for gateway_proxy_requests_total{proxy_id="1"}
Configure alerts: Alert when error rate > 5% or duration > 500ms
Review traces: Filter traces by proxy.id attribute

Use Case 3: Policy Effectiveness Analysis

Scenario: Analyze which policies are most effective.

Steps:

Query policy metrics: GET /api/v1/metrics → policy_metrics.policy_details
Calculate hit rates: matches / evaluations * 100
Identify top policies: Sort by match count or hit rate
Review policy traces: Filter traces by policy.name attribute
Optimize: Tune policies with low hit rates or high evaluation times

Use Case 4: Database Performance Tuning

Scenario: Optimize database query performance.

Steps:

Check database metrics: GET /api/v1/metrics → database_metrics
Identify slow queries: Review traces with db.query spans > 100ms
Analyze query patterns: Look for N+1 query patterns in traces
Optimize: Add indexes, batch queries, or add caching
Monitor improvements: Track database_metrics.duration_p95 over time

Quick Reference

Environment Variables

bash

# Metrics (always enabled, no config needed)

# Tracing
export TRACING_ENABLED=true
export TRACING_SERVICE_NAME=ai-security-gateway
export TRACING_ENVIRONMENT=production
export TRACING_JAEGER_URL=http://localhost:14268/api/traces
# OR
export TRACING_OTLP_ENDPOINT=http://localhost:4318
export TRACING_SAMPLE_RATE=0.1

API Endpoints

Endpoint	Method	Description	Auth Required
`/api/v1/metrics`	GET	Get all metrics (JSON)	Yes
`/api/v1/metrics/prometheus`	GET	Prometheus metrics	No
`/api/v1/system/db/pool/health`	GET	Connection pool health	Yes
`/api/v1/system/db/pool/stats`	GET	Connection pool stats	Yes

Docker Compose Example

yaml

version: '3.8'

services:
  gateway:
    image: ai-security-gateway:latest
    ports:
      - "8080:8080"
    environment:
      - TRACING_ENABLED=true
      - TRACING_JAEGER_URL=http://jaeger:14268/api/traces
      - TRACING_SERVICE_NAME=ai-security-gateway
      - TRACING_ENVIRONMENT=production
      - TRACING_SAMPLE_RATE=0.1

  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "16686:16686"  # UI
      - "14268:14268"  # Collector

  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin

Support

For additional help:

Review docs/tracing-guide.md for detailed tracing documentation
Check gateway logs for errors
Review metrics endpoint responses for system status
Consult OpenTelemetry documentation for advanced tracing configuration

Observability & Monitoring User Guide ​

Observability Overview ​

Observability Table of Contents ​

Enhanced Metrics ​

Metrics Overview ​

Metrics Configuration ​

Accessing Metrics ​

1. REST API Endpoint ​

2. Prometheus Endpoint ​

3. Connection Pool Health Endpoint ​

4. Connection Pool Stats Endpoint ​

Metrics Dashboard Integration ​

Grafana Dashboard Setup ​

Using Metrics for Monitoring ​

Alerting Rules (Prometheus) ​

Distributed Tracing ​

Tracing Overview ​

Tracing Configuration ​

Environment Variables ​

Setup Instructions ​

Option 1: Jaeger (Quick Start) ​

Option 2: OTLP (OpenTelemetry Protocol) ​

How Tracing Works ​

Automatic Tracing ​

Trace Context Propagation ​

Trace Structure ​

Viewing Traces ​

Jaeger UI ​

Trace Attributes ​

Sampling Configuration ​

Trace Context in Logs ​

Integration with Monitoring Systems ​

Prometheus + Grafana ​

1. Configure Prometheus ​

2. Start Prometheus ​

3. Create Grafana Dashboard ​

Datadog Integration ​

1. Configure Datadog OTLP Exporter ​

2. Configure Datadog Agent ​

New Relic Integration ​

1. Configure New Relic OTLP Exporter ​

2. Add API Key Header ​

Custom Observability Backend ​

Best Practices ​

Metrics ​

Tracing ​

Security Considerations ​

Troubleshooting ​

Metrics Not Appearing ​

Prometheus Scraping Fails ​

Traces Not Appearing in Jaeger ​

High Memory Usage from Tracing ​

Missing Trace Context ​

Connection Pool Alerts ​

Example Use Cases ​

Use Case 1: Debugging Slow API Endpoint ​

Use Case 2: Monitoring Proxy Performance ​

Use Case 3: Policy Effectiveness Analysis ​

Use Case 4: Database Performance Tuning ​

Quick Reference ​

Environment Variables ​

API Endpoints ​

Docker Compose Example ​

Support ​

Observability & Monitoring User Guide

Observability Overview

Observability Table of Contents

Enhanced Metrics

Metrics Overview

Metrics Configuration

Accessing Metrics

1. REST API Endpoint

2. Prometheus Endpoint

3. Connection Pool Health Endpoint

4. Connection Pool Stats Endpoint

Metrics Dashboard Integration

Grafana Dashboard Setup

Using Metrics for Monitoring

Alerting Rules (Prometheus)

Distributed Tracing

Tracing Overview

Tracing Configuration

Environment Variables

Setup Instructions

Option 1: Jaeger (Quick Start)

Option 2: OTLP (OpenTelemetry Protocol)

How Tracing Works

Automatic Tracing

Trace Context Propagation

Trace Structure

Viewing Traces

Jaeger UI

Trace Attributes

Sampling Configuration

Trace Context in Logs

Integration with Monitoring Systems

Prometheus + Grafana

1. Configure Prometheus

2. Start Prometheus

3. Create Grafana Dashboard

Datadog Integration

1. Configure Datadog OTLP Exporter

2. Configure Datadog Agent

New Relic Integration

1. Configure New Relic OTLP Exporter

2. Add API Key Header

Custom Observability Backend

Best Practices

Metrics

Tracing

Security Considerations

Troubleshooting

Metrics Not Appearing

Prometheus Scraping Fails

Traces Not Appearing in Jaeger

High Memory Usage from Tracing

Missing Trace Context

Connection Pool Alerts

Example Use Cases

Use Case 1: Debugging Slow API Endpoint

Use Case 2: Monitoring Proxy Performance

Use Case 3: Policy Effectiveness Analysis

Use Case 4: Database Performance Tuning

Quick Reference

Environment Variables

API Endpoints

Docker Compose Example

Support