How to Evaluate LLM Providers¶
Full LLM evaluation platform for testing AI providers against structured test suites, comparing models, iterating on prompts, and tracking performance over time.
Overview¶
The LLM testing framework provides:
- Provider management -- register any OpenAI-compatible API with encrypted credentials and custom pricing
- Test specs -- markdown-based test suites with system prompts, test cases, and assertions
- Datasets -- reusable test case collections with versioning, CSV import/export, and golden marking
- Run & Compare -- execute suites against providers, compare models side-by-side with scoring
- Prompt iterations -- A/B test system prompt changes with automated scoring against baselines
- AI suite generation -- generate test suites from system prompt + app description
- Dataset augmentation -- AI-powered generation of edge cases, adversarial inputs, and boundary tests
- Scheduling -- cron-based automated runs with execution history
- Analytics -- trends, latency, cost tracking, regression detection, golden dashboard
Prerequisites¶
- Quorvex AI installed and running (
make devormake prod-dev) - At least one LLM provider API key (OpenAI, Anthropic, or any OpenAI-compatible API)
Step-by-Step Usage¶
1. Register a Provider¶
Navigate to LLM Testing > Providers in the dashboard, or use the API:
curl -X POST http://localhost:8001/llm-testing/providers \
-H "Content-Type: application/json" \
-d '{
"name": "OpenAI GPT-4o",
"provider_type": "openai",
"api_key": "sk-...",
"base_url": "https://api.openai.com/v1",
"model_id": "gpt-4o",
"input_cost_per_1k": 0.005,
"output_cost_per_1k": 0.015,
"project_id": "your-project-id"
}'
The API key is encrypted at rest. You can register multiple providers to compare.
2. Create a Test Spec¶
Write a markdown test suite:
# LLM Test: Customer Support Bot
## System Prompt
You are a helpful customer support assistant for an e-commerce platform.
Always be polite, concise, and suggest relevant products when appropriate.
## Test Cases
### Greeting
- Input: "Hello!"
- Expected: Response should be a friendly greeting
- Assertions:
- Contains a greeting word (hello, hi, hey, welcome)
- Length < 200 characters
### Order Status
- Input: "Where is my order #12345?"
- Expected: Should acknowledge the order number and offer to help
- Assertions:
- Mentions order number "12345"
- Offers to look up the status
### Refund Request
- Input: "I want a refund for my broken laptop"
- Expected: Should empathize and explain the refund process
- Assertions:
- Shows empathy
- Mentions refund policy or process
### Off-Topic
- Input: "What's the meaning of life?"
- Expected: Should politely redirect to support topics
- Assertions:
- Does not provide a philosophical answer
- Redirects to e-commerce support
Save via the dashboard or API:
curl -X POST http://localhost:8001/llm-testing/specs \
-H "Content-Type: application/json" \
-d '{
"name": "customer-support-bot",
"content": "# LLM Test: Customer Support Bot\n...",
"project_id": "your-project-id"
}'
3. Run a Test Suite¶
Execute the suite against a provider:
curl -X POST http://localhost:8001/llm-testing/run \
-H "Content-Type: application/json" \
-d '{
"spec_name": "customer-support-bot",
"provider_id": "PROVIDER_ID",
"project_id": "your-project-id"
}'
The system sends each test case to the provider, evaluates assertions, and stores results.
4. Compare Providers¶
Run the same suite against multiple providers and compare:
curl -X POST http://localhost:8001/llm-testing/compare \
-H "Content-Type: application/json" \
-d '{
"spec_name": "customer-support-bot",
"provider_ids": ["PROVIDER_1", "PROVIDER_2", "PROVIDER_3"],
"project_id": "your-project-id"
}'
The comparison shows a scoring matrix with pass rates, latency, and cost per provider.
5. Use Datasets¶
Create reusable test case collections:
curl -X POST http://localhost:8001/llm-testing/datasets \
-H "Content-Type: application/json" \
-d '{
"name": "edge-cases",
"description": "Edge case inputs for support bot",
"cases": [
{"input": "🔥🔥🔥", "expected": "Should handle emoji-only input"},
{"input": "", "expected": "Should handle empty input gracefully"},
{"input": "a]{{very long input repeated 1000 times}}", "expected": "Should handle long input"}
],
"project_id": "your-project-id"
}'
Datasets support: - Versioning -- track changes over time - CSV import/export -- bulk management - Golden marking -- mark a dataset as the baseline for regression detection - AI augmentation -- generate additional test cases automatically
6. AI Dataset Augmentation¶
Generate additional test cases using AI:
curl -X POST http://localhost:8001/llm-testing/datasets/DATASET_ID/augment \
-H "Content-Type: application/json" \
-d '{
"augmentation_type": "edge_cases",
"count": 10,
"project_id": "your-project-id"
}'
Augmentation types: edge_cases, adversarial, boundary, rephrase.
Review and accept/reject generated cases before they are added to the dataset.
7. Prompt Iterations¶
A/B test system prompt changes:
curl -X POST http://localhost:8001/llm-testing/prompt-iterations \
-H "Content-Type: application/json" \
-d '{
"spec_name": "customer-support-bot",
"provider_id": "PROVIDER_ID",
"original_prompt": "You are a helpful customer support assistant...",
"modified_prompt": "You are an expert customer support agent for a premium e-commerce brand...",
"project_id": "your-project-id"
}'
Results show scoring comparison between the original and modified prompts.
8. AI Suite Generation¶
Generate a complete test suite from a system prompt and app description:
curl -X POST http://localhost:8001/llm-testing/generate-suite \
-H "Content-Type: application/json" \
-d '{
"system_prompt": "You are a helpful customer support assistant...",
"app_description": "E-commerce platform selling electronics",
"project_id": "your-project-id"
}'
9. Schedule Automated Runs¶
Create cron-based schedules for recurring test execution:
curl -X POST http://localhost:8001/llm-testing/schedules \
-H "Content-Type: application/json" \
-d '{
"name": "Daily Support Bot Check",
"spec_name": "customer-support-bot",
"provider_id": "PROVIDER_ID",
"cron_expression": "0 9 * * *",
"project_id": "your-project-id"
}'
10. Monitor Analytics¶
Access analytics dashboards:
# Overview stats
curl http://localhost:8001/llm-testing/analytics/overview?project_id=your-project-id
# Performance trends
curl http://localhost:8001/llm-testing/analytics/trends?project_id=your-project-id
# Cost tracking
curl http://localhost:8001/llm-testing/analytics/cost?project_id=your-project-id
# Regression detection
curl http://localhost:8001/llm-testing/analytics/regressions?project_id=your-project-id
# Golden dashboard
curl http://localhost:8001/llm-testing/analytics/golden?project_id=your-project-id
Configuration¶
Provider pricing is configured per-provider (input/output cost per 1K tokens). Analytics use these values for cost tracking.
No special environment variables are needed beyond the standard AI credentials in .env.
API Endpoints Reference¶
| Method | Path | Description |
|---|---|---|
| POST/GET/PUT/DELETE | /llm-testing/providers | Provider CRUD |
| POST/GET/PUT/DELETE | /llm-testing/specs | Test spec CRUD |
| GET | /llm-testing/specs/{name}/versions | Spec versions |
| POST | /llm-testing/run | Run suite against provider |
| POST | /llm-testing/compare | Compare providers |
| POST | /llm-testing/bulk-run | Batch dataset operations |
| POST | /llm-testing/bulk-compare | Batch comparison |
| POST | /llm-testing/generate-suite | AI suite generation |
| POST/GET/PUT/DELETE | /llm-testing/datasets | Dataset CRUD |
| POST | /llm-testing/datasets/{id}/augment | AI augmentation |
| POST/GET/PUT/DELETE | /llm-testing/schedules | Schedule CRUD |
| GET | /llm-testing/analytics/* | Analytics endpoints |
| POST | /llm-testing/prompt-iterations | A/B prompt testing |
| POST | /llm-testing/specs/{name}/suggest-improvements | AI spec improvements |
Key Files¶
| Path | Purpose |
|---|---|
orchestrator/api/llm_testing.py | All endpoints (~3400 lines) |
orchestrator/workflows/dataset_augmentor.py | AI dataset augmentation |
orchestrator/api/models_db.py | Database models |
web/src/app/(dashboard)/llm-testing/ | Frontend (Providers, Specs, Run, Compare, History, Datasets, Analytics, Prompts, Schedules) |
Troubleshooting¶
| Problem | Solution |
|---|---|
| Provider health check fails | Verify API key and base URL. Check network connectivity. |
| Test cases timeout | The provider may be rate-limited. Check provider dashboard. |
| Cost tracking shows $0 | Configure input_cost_per_1k and output_cost_per_1k on the provider. |
| Augmentation returns empty results | Check AI credentials in .env |
| Schedule not executing | Verify the schedule is enabled and check make prod-logs for scheduler errors |
| Golden dashboard empty | Mark a dataset as "golden" first, then run tests against it |
| Regression detection false positives | Adjust the baseline by re-running with the golden dataset |
Verification¶
Confirm LLM testing works:
- Provider health check passes after registration
- Running a suite returns scored results for each test case
- Comparison shows a scoring matrix across multiple providers
- Analytics dashboards display trends, costs, and latency data
- Scheduled runs execute and appear in execution history
Related Guides¶
- Scheduling -- automate LLM test runs
- API Testing -- test the LLM provider's HTTP API directly
- Credential Management -- manage provider API keys
- Extending -- add custom assertion types