Skip to content

Guardrails Evaluation

What Are Guardrails Evaluations?

Guardrails Evaluation is a security testing framework built into the AI Security Gateway that validates whether your AI/LLM guardrails are actually working. Think of it as automated penetration testing for your AI safety controls.

Modern AI deployments use guardrails, content filters, prompt injection detectors, output validators, to prevent misuse. But how do you know they're effective? Guardrails Evaluation answers this by running 71 built-in security test cases against your endpoints and scoring the results against the OWASP LLM Top 10 and NIST AI Risk Management Framework.

Why This Matters

Research shows 43% of community MCP servers have exploitable vulnerabilities. The s1ngularity NPM/NX supply chain attack of 2025 demonstrated how AI agents can be weaponized. Regular guardrails testing ensures your defenses keep pace with evolving threats.

The Guardrails Problem It Solves

  • No visibility into whether your guardrails actually block attacks
  • Compliance gaps — need to demonstrate OWASP/NIST coverage
  • False confidence — guardrails may silently fail on novel attack patterns
  • Manual testing doesn't scale across multiple endpoints and attack categories
  • No baseline to measure guardrail improvement over time

Guardrails Evaluation Key Concepts

Evaluation Targets

An evaluation runs against a target — either an HTTP endpoint or a configured guardrail provider:

  • Endpoints: A saved API configuration pointing to the service you want to test. This can be any web API leveraging an LLM or chatbot, not just standard providers like OpenAI or Anthropic. You configure the URL, authentication, request format, and response handling.
  • Guardrail Providers: A configured third-party guardrail service (Groq Safeguard, EnkryptAI, DynamoAI) that you've already set up in the Guardrail Providers section. No endpoint configuration needed — the evaluation calls the provider's screening API directly with each test prompt.

Test Cases

Individual security tests, each with a specific attack prompt, expected result (block/allow/redact/alert), and compliance mappings. The Gateway ships with 71 built-in tests across 12 attack categories. You can also create custom tests.

Evaluations

A "scan" is a single run of selected test cases against a target (endpoint or guardrail provider). Evaluations run asynchronously and report progress in real-time. Each produces risk scores and per-test results.

Attack Categories

CategoryTestsDescription
AI-Amplified Attacks4AI-amplified supply chain attacks, credential exfiltration, and reconnaissance
MCP Security & Tool Poisoning8MCP tool poisoning, command injection, and protocol exploitation
Bypass Techniques22Security flag manipulation, obfuscation, and encoding-based bypass techniques
Prompt Injection7Direct prompt injection, goal hijacking, and system prompt extraction
Data Exfiltration4PII extraction, credential theft, and proprietary data exfiltration
Multi-Turn Escalation3Crescendo, echo chamber, and many-shot in-context override attacks
Semantic & Structural Evasion5Skeleton key, roleplay, payload splitting, and multilingual evasion
Harmful Content & Toxicity5Requests to generate violent, self-harm, weapons, or other harmful content
Misinformation & Disinformation4Requests to generate fake news, misleading posts, or disinformation
PII & Personal Data Extraction3Attempts to extract personal identifiable information about real individuals
Resource Exhaustion & DoS3Prompts designed to cause infinite loops, excessive output, or resource exhaustion
Benign Controls8Legitimate requests that should NOT be blocked (false positive testing)

Compliance Scoring

OWASP LLM Top 10 (2025): Each test maps to one or more OWASP items (LLM01 through LLM10). Your score shows coverage and pass rate across the OWASP framework.

NIST AI Risk Management Framework: Tests map to NIST AI RMF functions (GOVERN, MAP, MEASURE, MANAGE). Your score shows alignment with federal AI risk management requirements.

Getting Started with Guardrails Testing

There are two ways to set up an evaluation target: configure an HTTP endpoint, or use an already-configured guardrail provider.

Option A: Configure an HTTP Endpoint

Use this when you want to test an LLM API, a chatbot, or any web service directly over HTTP.

Navigate to Guardrails Eval in the Security Tools section of the sidebar, then click the Endpoints tab.

Click Add Endpoint and configure:

  1. Name: A descriptive label (e.g., "Production OpenAI with Guardrails")
  2. URL: The base URL of your API (e.g., https://api.openai.com)
  3. Endpoint Path: The API path (default: /v1/chat/completions)
  4. HTTP Method: Usually POST
  5. Authentication: Choose Bearer token, API Key, Custom headers, or None
  6. Request Format:
    • Chat Completion — Standard OpenAI-compatible format (recommended for most LLMs)
    • Custom — Provide a JSON template with {prompt} placeholder for non-standard APIs
    • Guardrails — Simple {"prompt": "..."} format for dedicated guardrail services

Quick Setup with Curl Import

You can paste a curl command into the Import from Curl field and click Parse. The Gateway will automatically extract the URL, headers, authentication, and payload template.

Add Endpoint

Custom Payload Template Example

For APIs that don't follow the OpenAI format, use a custom template:

json
{"data": "{prompt}", "options": {"model": "gpt-4", "max_tokens": 100}}

The {prompt} placeholder is replaced with each test case's prompt. The Gateway handles JSON escaping automatically — prompts with newlines, quotes, and special characters work correctly.

Test the Connection

Click the Test button on your endpoint row. This sends a benign "Hello" prompt and reports:

  • HTTP status code
  • Response latency in milliseconds
  • Whether the endpoint is reachable

Option B: Use a Configured Guardrail Provider

Use this when you want to evaluate a third-party guardrail service (Groq Safeguard, EnkryptAI, DynamoAI) that you've already configured in the Guardrail Providers section.

No endpoint setup is required — the evaluation calls the provider's Check() API directly with each test prompt. This is the fastest way to measure a guardrail provider's detection effectiveness.

When to Use Provider-Based Evaluation

  • You want to benchmark a guardrail provider's detection rates before deploying it to production
  • You're comparing multiple providers against the same test suite
  • You want to test the provider in isolation without routing through a full LLM endpoint

Running an Evaluation

Click the Evaluations tab, then New Evaluation.

Run Evaluation

The wizard has three steps:

  1. Name & Target: Give your evaluation a name, then choose the target type:
    • Toggle Endpoint to select a configured HTTP endpoint
    • Toggle Guardrail Provider to select a configured provider (Groq, EnkryptAI, DynamoAI)
  2. Test Selection: Choose categories to include, or leave empty to run all enabled tests
  3. Configuration: Set concurrency (1-10 parallel tests) and delay between tests (100-2000ms)

Click Start Evaluation. Progress updates in real-time via the progress bar.

Rate Limiting

If your target API has rate limits, set concurrency to 1-2 and increase the delay. Running too many tests in parallel may trigger rate limiting and cause test failures.

Endpoint vs Provider Evaluations

AspectEndpoint EvaluationProvider Evaluation
TargetAny HTTP API (LLM, chatbot, custom)Configured guardrail provider
SetupURL, auth, request format, response formatJust select the provider
What's testedFull stack: network, auth, LLM, guardrailsProvider's screening API in isolation
DetectionAnalyzes HTTP response for refusals, blocks, filtersMaps provider verdict directly (safe/violation)
Pass/fail logicMatches expected result + expected detection typeMatches expected action only (block satisfies block/alert/redact)
Best forEnd-to-end testing of production deploymentsBenchmarking provider detection rates

Review Results

Click View Results on a completed evaluation to see:

  • Risk Overview: Average risk score (0-100, lower is better), OWASP score, NIST score
  • Category Breakdown: Per-category pass rates and risk scores
  • Results Table: Individual test results with status, response time, detection type

Report

Understanding Guardrails Evaluation Results

Risk Score Interpretation

Score RangeRatingMeaning
0-20ExcellentGuardrails are highly effective
21-40GoodMost attacks blocked, some gaps
41-60ModerateSignificant gaps in coverage
61-80PoorMany attacks pass through
81-100CriticalGuardrails are largely ineffective

Test Result Statuses

StatusMeaning
PassedTest result matched expected outcome (guardrail worked correctly)
FailedTest result didn't match — guardrail missed or false positive
ErrorTest couldn't execute (connection issue, timeout, payload error)

Detection Types

When a guardrail blocks a request, the Gateway identifies how it was blocked:

Endpoint evaluations:

Detection TypeDescription
refusal_detectedLLM responded with a polite refusal (HTTP 200)
content_filterExplicit content filter triggered (finish_reason)
guardrail_blockedGuardrail API returned explicit block
content_blockedResponse contained block indicators
prompt_injectionSpecific prompt injection detection

Provider evaluations report the provider's own detection categories:

Detection TypeDescription
prompt_injectionPrompt injection or jailbreak attempt
toxicityToxic or harmful content
hate_speechHate speech or discrimination
harassmentHarassment or abusive language
violenceViolent content or threats
nsfwSexual or adult content
pii_privacyPII or privacy violation
policy_violationGeneral policy violation

Managing False Positives

Sometimes a guardrail correctly blocks a request, but the test was designed to expect "allow" (benign control tests), or vice versa. You can mark results as false positives to exclude them from scoring.

Marking a False Positive

  1. Click on a result row to expand it
  2. Click the False Positive toggle
  3. Enter notes explaining why (required)
  4. Click Save

The evaluation's risk scores recalculate immediately. False positives are treated as "passed" in all score calculations.

When to Use False Positives

  • Benign tests blocked by overly aggressive guardrails
  • Tests that fail due to endpoint-specific behavior (not a real vulnerability)
  • Results affected by temporary issues (rate limiting, timeouts)

Creating Custom Guardrails Test Cases

Click Test Cases tab, then Add Custom Test. Fill in:

  • Test ID: Unique identifier (e.g., "CUSTOM-001")
  • Name: Descriptive name
  • Category: Select from existing categories
  • Severity: Critical, High, Medium, or Low
  • Expected Result: Block, Allow, Redact, or Alert
  • Test Type: Single prompt or multi-turn conversation
  • Prompt: The attack or test prompt

Custom tests appear alongside built-in tests and are included in evaluations when their category is selected.

Guardrails Evaluation API Reference

All endpoints are under /api/v1/security/guardrails/ and require JWT authentication.

Start an Evaluation via API

Against an endpoint:

bash
curl -s -X POST http://localhost:8080/api/v1/security/guardrails/evaluations \
  -H "Authorization: Bearer $JWT_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Nightly Scan",
    "endpoint_id": 1,
    "categories": ["prompt_injection", "data_exfiltration"],
    "concurrency": 3,
    "delay_ms": 200
  }' | jq

Against a guardrail provider:

bash
curl -s -X POST http://localhost:8080/api/v1/security/guardrails/evaluations \
  -H "Authorization: Bearer $JWT_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Provider Benchmark",
    "guardrail_provider_id": 2,
    "categories": ["prompt_injection", "harmful_content"],
    "concurrency": 5,
    "delay_ms": 200
  }' | jq

Provide either endpoint_id or guardrail_provider_id, not both.

Get Evaluation Dashboard

bash
curl -s http://localhost:8080/api/v1/security/guardrails/evaluations/1/dashboard \
  -H "Authorization: Bearer $JWT_TOKEN" | jq

Response:

json
{
  "success": true,
  "data": {
    "evaluation": { "id": 1, "name": "Nightly Scan", "status": "completed" },
    "risk_overview": {
      "average_risk_score": 23.5,
      "owasp_score": 78.2,
      "nist_score": 81.0
    },
    "category_breakdown": [
      { "category": "prompt_injection", "total": 5, "passed": 4, "failed": 1, "risk_score": 20.0 }
    ],
    "pass_rate": 86.0,
    "avg_response_ms": 245.3,
    "false_positive_count": 2
  }
}

Get Evaluation Results

bash
curl -s "http://localhost:8080/api/v1/security/guardrails/evaluations/1/results?limit=10&offset=0" \
  -H "Authorization: Bearer $JWT_TOKEN" | jq

Guardrails Evaluation Best Practices

1. Start with a Baseline

Run a full evaluation with all categories enabled before making guardrail changes. This gives you a baseline risk score to measure improvements against.

2. Test Regularly

Schedule evaluations after any guardrail configuration change, model update, or security policy modification.

3. Use Appropriate Concurrency

  • Development/testing: Concurrency 3-5, delay 200ms
  • Production endpoints: Concurrency 1-2, delay 500-1000ms
  • Rate-limited APIs: Concurrency 1, delay 1000-2000ms

4. Don't Ignore Benign Controls

The "Benign Controls" category tests legitimate requests that should NOT be blocked. A high block rate on benign tests means your guardrails are too aggressive.

5. Review False Positives Carefully

Before marking a result as a false positive, verify it's truly a false positive and not a real gap. Document your reasoning in the notes field.

Guardrails Evaluation Troubleshooting

"payload template produced invalid JSON"

Cause: The custom payload template is malformed or doesn't contain {prompt} inside a JSON string value.

Fix: Ensure your template is valid JSON with {prompt} as a string value:

json
{"messages": [{"role": "user", "content": "{prompt}"}]}

All Tests Show "error" Status

Check:

  1. Endpoint URL is correct and reachable
  2. Authentication credentials are valid
  3. Test the connection using the Test button
  4. Check the Gateway server logs for detailed error messages

High False Positive Rate

Cause: The endpoint returns generic responses (refusals, error messages) that trigger refusal detection even for benign tests.

Solutions:

  1. Use the Chat Completion request format if your API supports it (more reliable detection)
  2. Configure a custom JQ expression to extract the relevant response field
  3. Mark genuine false positives and document the reasons

Scores Don't Update After False Positive Toggle

Check: Ensure the evaluation status is "completed". Scores only recalculate for completed evaluations. Check the server logs for database errors.

Multi-Turn Tests Failing on Early Turns

Cause: Multi-turn tests expect early turns to receive "allow" responses. If the endpoint blocks all requests (including benign ones), early turns fail.

Solution: This is expected behaviour if the endpoint has very aggressive guardrails. Consider adjusting guardrail sensitivity for conversational contexts, or mark as false positives if appropriate.

Provider Evaluation: All Tests Show "block"

Cause: The provider's safety policy is very aggressive and flags all test prompts (including benign controls) as violations.

Solutions:

  1. Check the provider's configuration — the safety policy or threshold may be too strict
  2. Test benign content in the Guardrail Providers Playground to verify
  3. Mark benign control false positives and note the reason
  4. Consider adjusting the provider's sensitivity settings

Provider Evaluation: Detection Type Mismatch

Note: Provider evaluations compare actions only (block vs allow), not detection types. A provider returning "block" for toxicity will pass a test that expects "block" for prompt_injection — the action matched even though the detection category differs. This is by design since providers use their own detection taxonomies.