Skip to main content

Why Built-In Evaluation?

Traditional Testing:
  • Manual testing with ad-hoc prompts
  • Subjective quality checks
  • No baseline for comparisons
Station’s Evaluation:
  • Automated scenario generation
  • LLM-as-judge scoring
  • Business-focused metrics
  • Full execution traces

Via MCP Tools

The easiest way to run evaluations is through your AI assistant:
"Generate 50 test scenarios for the cost-analyzer agent and run them"
Station uses these MCP tools:
ToolPurpose
generate_and_test_agentGenerate scenarios + execute tests in one step
batch_execute_agentsRun multiple agents in parallel
evaluate_benchmarkLLM-as-judge scoring on a run
evaluate_datasetScore entire dataset
list_benchmark_resultsView evaluation results
get_benchmark_statusCheck async evaluation progress
Reports:
ToolPurpose
create_reportDefine team performance criteria
generate_reportRun benchmarks and generate report
list_reportsList all reports
get_reportGet report details

Via CLI

1. Generate Test Scenarios

Station AI generates diverse test scenarios based on your agent’s purpose:
stn benchmark generate --agent "Cost Analyzer" --count 50
Variation Strategies:
  • comprehensive - Wide range of scenarios (default)
  • edge_cases - Unusual boundary conditions
  • common - Typical real-world cases

2. Execute Tests

Run all scenarios with trace capture:
stn benchmark run --agent "Cost Analyzer"
Results saved to timestamped dataset:
~/.config/station/environments/default/datasets/
└── agent-42-20251215-103045/
    ├── dataset.json
    └── runs/

3. Evaluate Quality

LLM-as-judge analyzes each run:
stn benchmark evaluate --dataset agent-42-20251215-103045

Quality Metrics

MetricDescriptionWeight
AccuracyCorrectness of results30%
CompletenessAll aspects addressed25%
Tool UsageAppropriate tool selection20%
EfficiencyMinimal unnecessary steps15%
SafetyNo harmful actions10%

Evaluation Report

{
  "agent": "Cost Analyzer",
  "scenarios_tested": 50,
  "pass_rate": 0.92,
  "average_score": 8.4,
  "metrics": {
    "accuracy": 8.7,
    "completeness": 8.2,
    "tool_usage": 8.5,
    "efficiency": 8.1,
    "safety": 9.0
  },
  "production_ready": true
}

Team Reports

Evaluate multi-agent teams against business goals.

Via MCP

"Create a performance report for my SRE team measuring incident response time and accuracy"
"Generate the SRE team report"

Via CLI

stn report generate --environment default

Report Metrics

MetricDescription
Team ScoreOverall 1-10 rating
MTTRMean Time to Resolution
Cost Savings$ identified by agents
Accuracyvs ground truth
Tool CostExecution cost per agent

Example Report Output

Team Performance: 7.5/10

✅ Multi-agent coordination: 8.5/10 - Excellent delegation
✅ Tool utilization: 8.0/10 - Effective use of all tools
✅ Root cause analysis: 7.5/10 - Identifies issues accurately
⚠️ Resolution speed: 7.0/10 - Room for improvement
⚠️ Communication clarity: 6.5/10 - Could be more concise

Viewing Traces

Every evaluation run captures full execution traces:
# Start Jaeger
docker run -d --name jaeger \
  -p 16686:16686 -p 4317:4317 -p 4318:4318 \
  jaegertracing/all-in-one:latest

# View traces at http://localhost:16686

Best Practices

Run evaluation before promoting agents to production.
Include boundary conditions to find failure modes.
Compare scores across agent versions.
Manually inspect low-scoring runs to improve prompts.

CLI Reference

# Generate scenarios
stn benchmark generate --agent <name> --count <n> --strategy <type>

# Run benchmarks
stn benchmark run --agent <name> [--concurrent <n>]

# Evaluate dataset
stn benchmark evaluate --dataset <path>

# List datasets
stn benchmark list

# Generate team report
stn report generate --environment <name>

Next Steps