Guardrails Evaluation
What Are Guardrails Evaluations?
Guardrails Evaluation is a security testing framework built into the AI Security Gateway that validates whether your AI/LLM guardrails are actually working. Think of it as automated penetration testing for your AI safety controls.
Modern AI deployments use guardrails, content filters, prompt injection detectors, output validators, to prevent misuse. But how do you know they're effective? Guardrails Evaluation answers this by running 71 built-in security test cases against your endpoints and scoring the results against the OWASP LLM Top 10 and NIST AI Risk Management Framework.
Why This Matters
Research shows 43% of community MCP servers have exploitable vulnerabilities. The s1ngularity NPM/NX supply chain attack of 2025 demonstrated how AI agents can be weaponized. Regular guardrails testing ensures your defenses keep pace with evolving threats.
The Guardrails Problem It Solves
- No visibility into whether your guardrails actually block attacks
- Compliance gaps — need to demonstrate OWASP/NIST coverage
- False confidence — guardrails may silently fail on novel attack patterns
- Manual testing doesn't scale across multiple endpoints and attack categories
- No baseline to measure guardrail improvement over time
Guardrails Evaluation Key Concepts
Evaluation Targets
An evaluation runs against a target — either an HTTP endpoint or a configured guardrail provider:
- Endpoints: A saved API configuration pointing to the service you want to test. This can be any web API leveraging an LLM or chatbot, not just standard providers like OpenAI or Anthropic. You configure the URL, authentication, request format, and response handling.
- Guardrail Providers: A configured third-party guardrail service (Groq Safeguard, EnkryptAI, DynamoAI) that you've already set up in the Guardrail Providers section. No endpoint configuration needed — the evaluation calls the provider's screening API directly with each test prompt.
Test Cases
Individual security tests, each with a specific attack prompt, expected result (block/allow/redact/alert), and compliance mappings. The Gateway ships with 71 built-in tests across 12 attack categories. You can also create custom tests.
Evaluations
A "scan" is a single run of selected test cases against a target (endpoint or guardrail provider). Evaluations run asynchronously and report progress in real-time. Each produces risk scores and per-test results.
Attack Categories
| Category | Tests | Description |
|---|---|---|
| AI-Amplified Attacks | 4 | AI-amplified supply chain attacks, credential exfiltration, and reconnaissance |
| MCP Security & Tool Poisoning | 8 | MCP tool poisoning, command injection, and protocol exploitation |
| Bypass Techniques | 22 | Security flag manipulation, obfuscation, and encoding-based bypass techniques |
| Prompt Injection | 7 | Direct prompt injection, goal hijacking, and system prompt extraction |
| Data Exfiltration | 4 | PII extraction, credential theft, and proprietary data exfiltration |
| Multi-Turn Escalation | 3 | Crescendo, echo chamber, and many-shot in-context override attacks |
| Semantic & Structural Evasion | 5 | Skeleton key, roleplay, payload splitting, and multilingual evasion |
| Harmful Content & Toxicity | 5 | Requests to generate violent, self-harm, weapons, or other harmful content |
| Misinformation & Disinformation | 4 | Requests to generate fake news, misleading posts, or disinformation |
| PII & Personal Data Extraction | 3 | Attempts to extract personal identifiable information about real individuals |
| Resource Exhaustion & DoS | 3 | Prompts designed to cause infinite loops, excessive output, or resource exhaustion |
| Benign Controls | 8 | Legitimate requests that should NOT be blocked (false positive testing) |
Compliance Scoring
OWASP LLM Top 10 (2025): Each test maps to one or more OWASP items (LLM01 through LLM10). Your score shows coverage and pass rate across the OWASP framework.
NIST AI Risk Management Framework: Tests map to NIST AI RMF functions (GOVERN, MAP, MEASURE, MANAGE). Your score shows alignment with federal AI risk management requirements.
Getting Started with Guardrails Testing
There are two ways to set up an evaluation target: configure an HTTP endpoint, or use an already-configured guardrail provider.
Option A: Configure an HTTP Endpoint
Use this when you want to test an LLM API, a chatbot, or any web service directly over HTTP.
Navigate to Guardrails Eval in the Security Tools section of the sidebar, then click the Endpoints tab.
Click Add Endpoint and configure:
- Name: A descriptive label (e.g., "Production OpenAI with Guardrails")
- URL: The base URL of your API (e.g.,
https://api.openai.com) - Endpoint Path: The API path (default:
/v1/chat/completions) - HTTP Method: Usually
POST - Authentication: Choose Bearer token, API Key, Custom headers, or None
- Request Format:
- Chat Completion — Standard OpenAI-compatible format (recommended for most LLMs)
- Custom — Provide a JSON template with
{prompt}placeholder for non-standard APIs - Guardrails — Simple
{"prompt": "..."}format for dedicated guardrail services
Quick Setup with Curl Import
You can paste a curl command into the Import from Curl field and click Parse. The Gateway will automatically extract the URL, headers, authentication, and payload template.

Custom Payload Template Example
For APIs that don't follow the OpenAI format, use a custom template:
{"data": "{prompt}", "options": {"model": "gpt-4", "max_tokens": 100}}The {prompt} placeholder is replaced with each test case's prompt. The Gateway handles JSON escaping automatically — prompts with newlines, quotes, and special characters work correctly.
Test the Connection
Click the Test button on your endpoint row. This sends a benign "Hello" prompt and reports:
- HTTP status code
- Response latency in milliseconds
- Whether the endpoint is reachable
Option B: Use a Configured Guardrail Provider
Use this when you want to evaluate a third-party guardrail service (Groq Safeguard, EnkryptAI, DynamoAI) that you've already configured in the Guardrail Providers section.
No endpoint setup is required — the evaluation calls the provider's Check() API directly with each test prompt. This is the fastest way to measure a guardrail provider's detection effectiveness.
When to Use Provider-Based Evaluation
- You want to benchmark a guardrail provider's detection rates before deploying it to production
- You're comparing multiple providers against the same test suite
- You want to test the provider in isolation without routing through a full LLM endpoint
Running an Evaluation
Click the Evaluations tab, then New Evaluation.

The wizard has three steps:
- Name & Target: Give your evaluation a name, then choose the target type:
- Toggle Endpoint to select a configured HTTP endpoint
- Toggle Guardrail Provider to select a configured provider (Groq, EnkryptAI, DynamoAI)
- Test Selection: Choose categories to include, or leave empty to run all enabled tests
- Configuration: Set concurrency (1-10 parallel tests) and delay between tests (100-2000ms)
Click Start Evaluation. Progress updates in real-time via the progress bar.
Rate Limiting
If your target API has rate limits, set concurrency to 1-2 and increase the delay. Running too many tests in parallel may trigger rate limiting and cause test failures.
Endpoint vs Provider Evaluations
| Aspect | Endpoint Evaluation | Provider Evaluation |
|---|---|---|
| Target | Any HTTP API (LLM, chatbot, custom) | Configured guardrail provider |
| Setup | URL, auth, request format, response format | Just select the provider |
| What's tested | Full stack: network, auth, LLM, guardrails | Provider's screening API in isolation |
| Detection | Analyzes HTTP response for refusals, blocks, filters | Maps provider verdict directly (safe/violation) |
| Pass/fail logic | Matches expected result + expected detection type | Matches expected action only (block satisfies block/alert/redact) |
| Best for | End-to-end testing of production deployments | Benchmarking provider detection rates |
Review Results
Click View Results on a completed evaluation to see:
- Risk Overview: Average risk score (0-100, lower is better), OWASP score, NIST score
- Category Breakdown: Per-category pass rates and risk scores
- Results Table: Individual test results with status, response time, detection type

Understanding Guardrails Evaluation Results
Risk Score Interpretation
| Score Range | Rating | Meaning |
|---|---|---|
| 0-20 | Excellent | Guardrails are highly effective |
| 21-40 | Good | Most attacks blocked, some gaps |
| 41-60 | Moderate | Significant gaps in coverage |
| 61-80 | Poor | Many attacks pass through |
| 81-100 | Critical | Guardrails are largely ineffective |
Test Result Statuses
| Status | Meaning |
|---|---|
| Passed | Test result matched expected outcome (guardrail worked correctly) |
| Failed | Test result didn't match — guardrail missed or false positive |
| Error | Test couldn't execute (connection issue, timeout, payload error) |
Detection Types
When a guardrail blocks a request, the Gateway identifies how it was blocked:
Endpoint evaluations:
| Detection Type | Description |
|---|---|
refusal_detected | LLM responded with a polite refusal (HTTP 200) |
content_filter | Explicit content filter triggered (finish_reason) |
guardrail_blocked | Guardrail API returned explicit block |
content_blocked | Response contained block indicators |
prompt_injection | Specific prompt injection detection |
Provider evaluations report the provider's own detection categories:
| Detection Type | Description |
|---|---|
prompt_injection | Prompt injection or jailbreak attempt |
toxicity | Toxic or harmful content |
hate_speech | Hate speech or discrimination |
harassment | Harassment or abusive language |
violence | Violent content or threats |
nsfw | Sexual or adult content |
pii_privacy | PII or privacy violation |
policy_violation | General policy violation |
Managing False Positives
Sometimes a guardrail correctly blocks a request, but the test was designed to expect "allow" (benign control tests), or vice versa. You can mark results as false positives to exclude them from scoring.
Marking a False Positive
- Click on a result row to expand it
- Click the False Positive toggle
- Enter notes explaining why (required)
- Click Save
The evaluation's risk scores recalculate immediately. False positives are treated as "passed" in all score calculations.
When to Use False Positives
- Benign tests blocked by overly aggressive guardrails
- Tests that fail due to endpoint-specific behavior (not a real vulnerability)
- Results affected by temporary issues (rate limiting, timeouts)
Creating Custom Guardrails Test Cases
Click Test Cases tab, then Add Custom Test. Fill in:
- Test ID: Unique identifier (e.g., "CUSTOM-001")
- Name: Descriptive name
- Category: Select from existing categories
- Severity: Critical, High, Medium, or Low
- Expected Result: Block, Allow, Redact, or Alert
- Test Type: Single prompt or multi-turn conversation
- Prompt: The attack or test prompt
Custom tests appear alongside built-in tests and are included in evaluations when their category is selected.
Guardrails Evaluation API Reference
All endpoints are under /api/v1/security/guardrails/ and require JWT authentication.
Start an Evaluation via API
Against an endpoint:
curl -s -X POST http://localhost:8080/api/v1/security/guardrails/evaluations \
-H "Authorization: Bearer $JWT_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"name": "Nightly Scan",
"endpoint_id": 1,
"categories": ["prompt_injection", "data_exfiltration"],
"concurrency": 3,
"delay_ms": 200
}' | jqAgainst a guardrail provider:
curl -s -X POST http://localhost:8080/api/v1/security/guardrails/evaluations \
-H "Authorization: Bearer $JWT_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"name": "Provider Benchmark",
"guardrail_provider_id": 2,
"categories": ["prompt_injection", "harmful_content"],
"concurrency": 5,
"delay_ms": 200
}' | jqProvide either endpoint_id or guardrail_provider_id, not both.
Get Evaluation Dashboard
curl -s http://localhost:8080/api/v1/security/guardrails/evaluations/1/dashboard \
-H "Authorization: Bearer $JWT_TOKEN" | jqResponse:
{
"success": true,
"data": {
"evaluation": { "id": 1, "name": "Nightly Scan", "status": "completed" },
"risk_overview": {
"average_risk_score": 23.5,
"owasp_score": 78.2,
"nist_score": 81.0
},
"category_breakdown": [
{ "category": "prompt_injection", "total": 5, "passed": 4, "failed": 1, "risk_score": 20.0 }
],
"pass_rate": 86.0,
"avg_response_ms": 245.3,
"false_positive_count": 2
}
}Get Evaluation Results
curl -s "http://localhost:8080/api/v1/security/guardrails/evaluations/1/results?limit=10&offset=0" \
-H "Authorization: Bearer $JWT_TOKEN" | jqGuardrails Evaluation Best Practices
1. Start with a Baseline
Run a full evaluation with all categories enabled before making guardrail changes. This gives you a baseline risk score to measure improvements against.
2. Test Regularly
Schedule evaluations after any guardrail configuration change, model update, or security policy modification.
3. Use Appropriate Concurrency
- Development/testing: Concurrency 3-5, delay 200ms
- Production endpoints: Concurrency 1-2, delay 500-1000ms
- Rate-limited APIs: Concurrency 1, delay 1000-2000ms
4. Don't Ignore Benign Controls
The "Benign Controls" category tests legitimate requests that should NOT be blocked. A high block rate on benign tests means your guardrails are too aggressive.
5. Review False Positives Carefully
Before marking a result as a false positive, verify it's truly a false positive and not a real gap. Document your reasoning in the notes field.
Guardrails Evaluation Troubleshooting
"payload template produced invalid JSON"
Cause: The custom payload template is malformed or doesn't contain {prompt} inside a JSON string value.
Fix: Ensure your template is valid JSON with {prompt} as a string value:
{"messages": [{"role": "user", "content": "{prompt}"}]}All Tests Show "error" Status
Check:
- Endpoint URL is correct and reachable
- Authentication credentials are valid
- Test the connection using the Test button
- Check the Gateway server logs for detailed error messages
High False Positive Rate
Cause: The endpoint returns generic responses (refusals, error messages) that trigger refusal detection even for benign tests.
Solutions:
- Use the Chat Completion request format if your API supports it (more reliable detection)
- Configure a custom JQ expression to extract the relevant response field
- Mark genuine false positives and document the reasons
Scores Don't Update After False Positive Toggle
Check: Ensure the evaluation status is "completed". Scores only recalculate for completed evaluations. Check the server logs for database errors.
Multi-Turn Tests Failing on Early Turns
Cause: Multi-turn tests expect early turns to receive "allow" responses. If the endpoint blocks all requests (including benign ones), early turns fail.
Solution: This is expected behaviour if the endpoint has very aggressive guardrails. Consider adjusting guardrail sensitivity for conversational contexts, or mark as false positives if appropriate.
Provider Evaluation: All Tests Show "block"
Cause: The provider's safety policy is very aggressive and flags all test prompts (including benign controls) as violations.
Solutions:
- Check the provider's configuration — the safety policy or threshold may be too strict
- Test benign content in the Guardrail Providers Playground to verify
- Mark benign control false positives and note the reason
- Consider adjusting the provider's sensitivity settings
Provider Evaluation: Detection Type Mismatch
Note: Provider evaluations compare actions only (block vs allow), not detection types. A provider returning "block" for toxicity will pass a test that expects "block" for prompt_injection — the action matched even though the detection category differs. This is by design since providers use their own detection taxonomies.
Related Guardrails Documentation
- Guardrail Providers Guide — Configure and manage third-party guardrail providers
- Guardrail Providers Technical Documentation — Architecture, API reference, and implementation details
- Guardrails Evaluation Operations Guide — Detailed operational procedures
- Custom Policies Guide — Configure security policies
- Canary Token Detection — Another security testing feature