Guardrails Evaluation

What Are Guardrails Evaluations?

Guardrails Evaluation is a security testing framework built into the AI Security Gateway that validates whether your AI/LLM guardrails are actually working. Think of it as automated penetration testing for your AI safety controls.

Modern AI deployments use guardrails, content filters, prompt injection detectors, output validators, to prevent misuse. But how do you know they're effective? Guardrails Evaluation answers this by running 71 built-in security test cases against your endpoints and scoring the results against the OWASP LLM Top 10 and NIST AI Risk Management Framework.

Why This Matters

Research shows 43% of community MCP servers have exploitable vulnerabilities. The s1ngularity NPM/NX supply chain attack of 2025 demonstrated how AI agents can be weaponized. Regular guardrails testing ensures your defenses keep pace with evolving threats.

The Guardrails Problem It Solves

No visibility into whether your guardrails actually block attacks
Compliance gaps — need to demonstrate OWASP/NIST coverage
False confidence — guardrails may silently fail on novel attack patterns
Manual testing doesn't scale across multiple endpoints and attack categories
No baseline to measure guardrail improvement over time

Guardrails Evaluation Key Concepts

Evaluation Targets

An evaluation runs against a target — either an HTTP endpoint or a configured guardrail provider:

Endpoints: A saved API configuration pointing to the service you want to test. This can be any web API leveraging an LLM or chatbot, not just standard providers like OpenAI or Anthropic. You configure the URL, authentication, request format, and response handling.
Guardrail Providers: A configured third-party guardrail service (Groq Safeguard, EnkryptAI, DynamoAI) that you've already set up in the Guardrail Providers section. No endpoint configuration needed — the evaluation calls the provider's screening API directly with each test prompt.

Test Cases

Individual security tests, each with a specific attack prompt, expected result (block/allow/redact/alert), and compliance mappings. The Gateway ships with 71 built-in tests across 12 attack categories. You can also create custom tests.

Evaluations

A "scan" is a single run of selected test cases against a target (endpoint or guardrail provider). Evaluations run asynchronously and report progress in real-time. Each produces risk scores and per-test results.

Attack Categories

Category	Tests	Description
AI-Amplified Attacks	4	AI-amplified supply chain attacks, credential exfiltration, and reconnaissance
MCP Security & Tool Poisoning	8	MCP tool poisoning, command injection, and protocol exploitation
Bypass Techniques	22	Security flag manipulation, obfuscation, and encoding-based bypass techniques
Prompt Injection	7	Direct prompt injection, goal hijacking, and system prompt extraction
Data Exfiltration	4	PII extraction, credential theft, and proprietary data exfiltration
Multi-Turn Escalation	3	Crescendo, echo chamber, and many-shot in-context override attacks
Semantic & Structural Evasion	5	Skeleton key, roleplay, payload splitting, and multilingual evasion
Harmful Content & Toxicity	5	Requests to generate violent, self-harm, weapons, or other harmful content
Misinformation & Disinformation	4	Requests to generate fake news, misleading posts, or disinformation
PII & Personal Data Extraction	3	Attempts to extract personal identifiable information about real individuals
Resource Exhaustion & DoS	3	Prompts designed to cause infinite loops, excessive output, or resource exhaustion
Benign Controls	8	Legitimate requests that should NOT be blocked (false positive testing)

Compliance Scoring

OWASP LLM Top 10 (2025): Each test maps to one or more OWASP items (LLM01 through LLM10). Your score shows coverage and pass rate across the OWASP framework.

NIST AI Risk Management Framework: Tests map to NIST AI RMF functions (GOVERN, MAP, MEASURE, MANAGE). Your score shows alignment with federal AI risk management requirements.

Getting Started with Guardrails Testing

There are two ways to set up an evaluation target: configure an HTTP endpoint, or use an already-configured guardrail provider.

Option A: Configure an HTTP Endpoint

Use this when you want to test an LLM API, a chatbot, or any web service directly over HTTP.

Navigate to Guardrails Eval in the Security Tools section of the sidebar, then click the Endpoints tab.

Click Add Endpoint and configure:

Name: A descriptive label (e.g., "Production OpenAI with Guardrails")
URL: The base URL of your API (e.g., https://api.openai.com)
Endpoint Path: The API path (default: /v1/chat/completions)
HTTP Method: Usually POST
Authentication: Choose Bearer token, API Key, Custom headers, or None
Request Format:
- Chat Completion — Standard OpenAI-compatible format (recommended for most LLMs)
- Custom — Provide a JSON template with {prompt} placeholder for non-standard APIs
- Guardrails — Simple {"prompt": "..."} format for dedicated guardrail services

Quick Setup with Curl Import

You can paste a curl command into the Import from Curl field and click Parse. The Gateway will automatically extract the URL, headers, authentication, and payload template.

Add Endpoint

Custom Payload Template Example

For APIs that don't follow the OpenAI format, use a custom template:

json

{"data": "{prompt}", "options": {"model": "gpt-4", "max_tokens": 100}}

The {prompt} placeholder is replaced with each test case's prompt. The Gateway handles JSON escaping automatically — prompts with newlines, quotes, and special characters work correctly.

Test the Connection

Click the Test button on your endpoint row. This sends a benign "Hello" prompt and reports:

HTTP status code
Response latency in milliseconds
Whether the endpoint is reachable

Option B: Use a Configured Guardrail Provider

Use this when you want to evaluate a third-party guardrail service (Groq Safeguard, EnkryptAI, DynamoAI) that you've already configured in the Guardrail Providers section.

No endpoint setup is required — the evaluation calls the provider's Check() API directly with each test prompt. This is the fastest way to measure a guardrail provider's detection effectiveness.

When to Use Provider-Based Evaluation

You want to benchmark a guardrail provider's detection rates before deploying it to production
You're comparing multiple providers against the same test suite
You want to test the provider in isolation without routing through a full LLM endpoint

Running an Evaluation

Click the Evaluations tab, then New Evaluation.

Run Evaluation

The wizard has three steps:

Name & Target: Give your evaluation a name, then choose the target type:
- Toggle Endpoint to select a configured HTTP endpoint
- Toggle Guardrail Provider to select a configured provider (Groq, EnkryptAI, DynamoAI)
Test Selection: Choose categories to include, or leave empty to run all enabled tests
Configuration: Set concurrency (1-10 parallel tests) and delay between tests (100-2000ms)

Click Start Evaluation. Progress updates in real-time via the progress bar.

Rate Limiting

If your target API has rate limits, set concurrency to 1-2 and increase the delay. Running too many tests in parallel may trigger rate limiting and cause test failures.

Endpoint vs Provider Evaluations

Aspect	Endpoint Evaluation	Provider Evaluation
Target	Any HTTP API (LLM, chatbot, custom)	Configured guardrail provider
Setup	URL, auth, request format, response format	Just select the provider
What's tested	Full stack: network, auth, LLM, guardrails	Provider's screening API in isolation
Detection	Analyzes HTTP response for refusals, blocks, filters	Maps provider verdict directly (safe/violation)
Pass/fail logic	Matches expected result + expected detection type	Matches expected action only (block satisfies block/alert/redact)
Best for	End-to-end testing of production deployments	Benchmarking provider detection rates

Review Results

Click View Results on a completed evaluation to see:

Risk Overview: Average risk score (0-100, lower is better), OWASP score, NIST score
Category Breakdown: Per-category pass rates and risk scores
Results Table: Individual test results with status, response time, detection type

Report

Understanding Guardrails Evaluation Results

Risk Score Interpretation

Score Range	Rating	Meaning
0-20	Excellent	Guardrails are highly effective
21-40	Good	Most attacks blocked, some gaps
41-60	Moderate	Significant gaps in coverage
61-80	Poor	Many attacks pass through
81-100	Critical	Guardrails are largely ineffective

Test Result Statuses

Status	Meaning
Passed	Test result matched expected outcome (guardrail worked correctly)
Failed	Test result didn't match — guardrail missed or false positive
Error	Test couldn't execute (connection issue, timeout, payload error)

Detection Types

When a guardrail blocks a request, the Gateway identifies how it was blocked:

Endpoint evaluations:

Detection Type	Description
`refusal_detected`	LLM responded with a polite refusal (HTTP 200)
`content_filter`	Explicit content filter triggered (finish_reason)
`guardrail_blocked`	Guardrail API returned explicit block
`content_blocked`	Response contained block indicators
`prompt_injection`	Specific prompt injection detection

Provider evaluations report the provider's own detection categories:

Detection Type	Description
`prompt_injection`	Prompt injection or jailbreak attempt
`toxicity`	Toxic or harmful content
`hate_speech`	Hate speech or discrimination
`harassment`	Harassment or abusive language
`violence`	Violent content or threats
`nsfw`	Sexual or adult content
`pii_privacy`	PII or privacy violation
`policy_violation`	General policy violation

Managing False Positives

Sometimes a guardrail correctly blocks a request, but the test was designed to expect "allow" (benign control tests), or vice versa. You can mark results as false positives to exclude them from scoring.

Marking a False Positive

Click on a result row to expand it
Click the False Positive toggle
Enter notes explaining why (required)
Click Save

The evaluation's risk scores recalculate immediately. False positives are treated as "passed" in all score calculations.

When to Use False Positives

Benign tests blocked by overly aggressive guardrails
Tests that fail due to endpoint-specific behavior (not a real vulnerability)
Results affected by temporary issues (rate limiting, timeouts)

Creating Custom Guardrails Test Cases

Click Test Cases tab, then Add Custom Test. Fill in:

Test ID: Unique identifier (e.g., "CUSTOM-001")
Name: Descriptive name
Category: Select from existing categories
Severity: Critical, High, Medium, or Low
Expected Result: Block, Allow, Redact, or Alert
Test Type: Single prompt or multi-turn conversation
Prompt: The attack or test prompt

Custom tests appear alongside built-in tests and are included in evaluations when their category is selected.

Guardrails Evaluation API Reference

All endpoints are under /api/v1/security/guardrails/ and require JWT authentication.

Start an Evaluation via API

Against an endpoint:

bash

curl -s -X POST http://localhost:8080/api/v1/security/guardrails/evaluations \
  -H "Authorization: Bearer $JWT_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Nightly Scan",
    "endpoint_id": 1,
    "categories": ["prompt_injection", "data_exfiltration"],
    "concurrency": 3,
    "delay_ms": 200
  }' | jq

Against a guardrail provider:

bash

curl -s -X POST http://localhost:8080/api/v1/security/guardrails/evaluations \
  -H "Authorization: Bearer $JWT_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Provider Benchmark",
    "guardrail_provider_id": 2,
    "categories": ["prompt_injection", "harmful_content"],
    "concurrency": 5,
    "delay_ms": 200
  }' | jq

Provide either endpoint_id or guardrail_provider_id, not both.

Get Evaluation Dashboard

bash

curl -s http://localhost:8080/api/v1/security/guardrails/evaluations/1/dashboard \
  -H "Authorization: Bearer $JWT_TOKEN" | jq

Response:

json

{
  "success": true,
  "data": {
    "evaluation": { "id": 1, "name": "Nightly Scan", "status": "completed" },
    "risk_overview": {
      "average_risk_score": 23.5,
      "owasp_score": 78.2,
      "nist_score": 81.0
    },
    "category_breakdown": [
      { "category": "prompt_injection", "total": 5, "passed": 4, "failed": 1, "risk_score": 20.0 }
    ],
    "pass_rate": 86.0,
    "avg_response_ms": 245.3,
    "false_positive_count": 2
  }
}

Get Evaluation Results

bash

curl -s "http://localhost:8080/api/v1/security/guardrails/evaluations/1/results?limit=10&offset=0" \
  -H "Authorization: Bearer $JWT_TOKEN" | jq

Guardrails Evaluation Best Practices

1. Start with a Baseline

Run a full evaluation with all categories enabled before making guardrail changes. This gives you a baseline risk score to measure improvements against.

2. Test Regularly

Schedule evaluations after any guardrail configuration change, model update, or security policy modification.

3. Use Appropriate Concurrency

Development/testing: Concurrency 3-5, delay 200ms
Production endpoints: Concurrency 1-2, delay 500-1000ms
Rate-limited APIs: Concurrency 1, delay 1000-2000ms

4. Don't Ignore Benign Controls

The "Benign Controls" category tests legitimate requests that should NOT be blocked. A high block rate on benign tests means your guardrails are too aggressive.

5. Review False Positives Carefully

Before marking a result as a false positive, verify it's truly a false positive and not a real gap. Document your reasoning in the notes field.

Guardrails Evaluation Troubleshooting

"payload template produced invalid JSON"

Cause: The custom payload template is malformed or doesn't contain {prompt} inside a JSON string value.

Fix: Ensure your template is valid JSON with {prompt} as a string value:

json

{"messages": [{"role": "user", "content": "{prompt}"}]}

All Tests Show "error" Status

Check:

Endpoint URL is correct and reachable
Authentication credentials are valid
Test the connection using the Test button
Check the Gateway server logs for detailed error messages

High False Positive Rate

Cause: The endpoint returns generic responses (refusals, error messages) that trigger refusal detection even for benign tests.

Solutions:

Use the Chat Completion request format if your API supports it (more reliable detection)
Configure a custom JQ expression to extract the relevant response field
Mark genuine false positives and document the reasons

Scores Don't Update After False Positive Toggle

Check: Ensure the evaluation status is "completed". Scores only recalculate for completed evaluations. Check the server logs for database errors.

Multi-Turn Tests Failing on Early Turns

Cause: Multi-turn tests expect early turns to receive "allow" responses. If the endpoint blocks all requests (including benign ones), early turns fail.

Solution: This is expected behaviour if the endpoint has very aggressive guardrails. Consider adjusting guardrail sensitivity for conversational contexts, or mark as false positives if appropriate.

Provider Evaluation: All Tests Show "block"

Cause: The provider's safety policy is very aggressive and flags all test prompts (including benign controls) as violations.

Solutions:

Check the provider's configuration — the safety policy or threshold may be too strict
Test benign content in the Guardrail Providers Playground to verify
Mark benign control false positives and note the reason
Consider adjusting the provider's sensitivity settings

Provider Evaluation: Detection Type Mismatch

Note: Provider evaluations compare actions only (block vs allow), not detection types. A provider returning "block" for toxicity will pass a test that expects "block" for prompt_injection — the action matched even though the detection category differs. This is by design since providers use their own detection taxonomies.

Guardrail Providers Guide — Configure and manage third-party guardrail providers
Guardrail Providers Technical Documentation — Architecture, API reference, and implementation details
Guardrails Evaluation Operations Guide — Detailed operational procedures
Custom Policies Guide — Configure security policies
Canary Token Detection — Another security testing feature

Guardrails Evaluation ​

What Are Guardrails Evaluations? ​

The Guardrails Problem It Solves ​

Guardrails Evaluation Key Concepts ​

Evaluation Targets ​

Test Cases ​

Evaluations ​

Attack Categories ​

Compliance Scoring ​

Getting Started with Guardrails Testing ​

Option A: Configure an HTTP Endpoint ​

Custom Payload Template Example ​

Test the Connection ​

Option B: Use a Configured Guardrail Provider ​

Running an Evaluation ​

Endpoint vs Provider Evaluations ​

Review Results ​

Understanding Guardrails Evaluation Results ​

Risk Score Interpretation ​

Test Result Statuses ​

Detection Types ​

Managing False Positives ​

Marking a False Positive ​

Creating Custom Guardrails Test Cases ​

Guardrails Evaluation API Reference ​

Start an Evaluation via API ​

Get Evaluation Dashboard ​

Get Evaluation Results ​

Guardrails Evaluation Best Practices ​

1. Start with a Baseline ​

2. Test Regularly ​

3. Use Appropriate Concurrency ​

4. Don't Ignore Benign Controls ​

5. Review False Positives Carefully ​

Guardrails Evaluation Troubleshooting ​

"payload template produced invalid JSON" ​

All Tests Show "error" Status ​

High False Positive Rate ​

Scores Don't Update After False Positive Toggle ​

Multi-Turn Tests Failing on Early Turns ​

Provider Evaluation: All Tests Show "block" ​

Provider Evaluation: Detection Type Mismatch ​

Related Guardrails Documentation ​

Guardrails Evaluation

What Are Guardrails Evaluations?

The Guardrails Problem It Solves

Guardrails Evaluation Key Concepts

Evaluation Targets

Test Cases

Evaluations

Attack Categories

Compliance Scoring

Getting Started with Guardrails Testing

Option A: Configure an HTTP Endpoint

Custom Payload Template Example

Test the Connection

Option B: Use a Configured Guardrail Provider

Running an Evaluation

Endpoint vs Provider Evaluations

Review Results

Understanding Guardrails Evaluation Results

Risk Score Interpretation

Test Result Statuses

Detection Types

Managing False Positives

Marking a False Positive

Creating Custom Guardrails Test Cases

Guardrails Evaluation API Reference

Start an Evaluation via API

Get Evaluation Dashboard

Get Evaluation Results

Guardrails Evaluation Best Practices

1. Start with a Baseline

2. Test Regularly

3. Use Appropriate Concurrency

4. Don't Ignore Benign Controls

5. Review False Positives Carefully

Guardrails Evaluation Troubleshooting

"payload template produced invalid JSON"

All Tests Show "error" Status

High False Positive Rate

Scores Don't Update After False Positive Toggle

Multi-Turn Tests Failing on Early Turns

Provider Evaluation: All Tests Show "block"

Provider Evaluation: Detection Type Mismatch

Related Guardrails Documentation