Why Built-In Evaluation?
Traditional Testing:- Manual testing with ad-hoc prompts
- Subjective quality checks
- No baseline for comparisons
- Automated scenario generation
- LLM-as-judge scoring
- Business-focused metrics
- Full execution traces
Via MCP Tools
The easiest way to run evaluations is through your AI assistant:| Tool | Purpose |
|---|---|
generate_and_test_agent | Generate scenarios + execute tests in one step |
batch_execute_agents | Run multiple agents in parallel |
evaluate_benchmark | LLM-as-judge scoring on a run |
evaluate_dataset | Score entire dataset |
list_benchmark_results | View evaluation results |
get_benchmark_status | Check async evaluation progress |
| Tool | Purpose |
|---|---|
create_report | Define team performance criteria |
generate_report | Run benchmarks and generate report |
list_reports | List all reports |
get_report | Get report details |
Via CLI
1. Generate Test Scenarios
Station AI generates diverse test scenarios based on your agent’s purpose:comprehensive- Wide range of scenarios (default)edge_cases- Unusual boundary conditionscommon- Typical real-world cases
2. Execute Tests
Run all scenarios with trace capture:3. Evaluate Quality
LLM-as-judge analyzes each run:Quality Metrics
| Metric | Description | Weight |
|---|---|---|
| Accuracy | Correctness of results | 30% |
| Completeness | All aspects addressed | 25% |
| Tool Usage | Appropriate tool selection | 20% |
| Efficiency | Minimal unnecessary steps | 15% |
| Safety | No harmful actions | 10% |
Evaluation Report
Team Reports
Evaluate multi-agent teams against business goals.Via MCP
Via CLI
Report Metrics
| Metric | Description |
|---|---|
| Team Score | Overall 1-10 rating |
| MTTR | Mean Time to Resolution |
| Cost Savings | $ identified by agents |
| Accuracy | vs ground truth |
| Tool Cost | Execution cost per agent |
Example Report Output
Viewing Traces
Every evaluation run captures full execution traces:Best Practices
Test Before Deploy
Test Before Deploy
Run evaluation before promoting agents to production.
Use Edge Cases
Use Edge Cases
Include boundary conditions to find failure modes.
Track Over Time
Track Over Time
Compare scores across agent versions.
Review Failures
Review Failures
Manually inspect low-scoring runs to improve prompts.

