Quality Checks
Exit criteria, calibration, generation quality, and A/B experiments.
AutoFlow includes automated quality checkers that verify the system meets its design targets.
Exit Criteria
The ExitCriteriaChecker evaluates long-term improvement by comparing the first half of run history (baseline) against the second half (current):
python -m autoflow --exit-criteriaTargets:
| Criterion | Target | How It's Measured |
|---|---|---|
| PASS rate improvement | >= 10% | Compare baseline vs current PASS rates |
| Edit time decrease | >= 20% | Compare baseline vs current avg estimated_human_edit_minutes |
| Pattern extraction | >= 5 new patterns | Count distinct patterns used in current half |
All three criteria must be met for the check to pass.
Calibration
The CalibrationChecker measures whether the evaluator agent's correctness scores agree with human rating feedback:
python -m autoflow --calibrationHow it works:
- Pairs each
RunRecord's evaluator correctness score with the correspondingFeedbackRecord's human rating - Scores within a configurable
tolerance(default: 1 point) count as agreement - Reports agreement rate — target is >= 85%
Example output:
Calibration Check
Paired records: 47
Agreement rate: 89.4% (target: 85.0%)
Status: PASSGeneration Quality
The GenerationQualityChecker validates that workflow generation meets core quality targets:
python -m autoflow --generation-qualityTargets:
| Criterion | Target | What It Tests |
|---|---|---|
| Schema validation | 100% pass | All catalog templates pass JSON Schema + custom rules |
| Import roundtrip | >= 80% success | Export → import cycle preserves workflow integrity |
| Quality score | >= 4.0 average | Average composite quality score across all templates |
Uses the CatalogManager to discover all templates and test all three dimensions.
A/B Experiments
The ExperimentTracker supports A/B testing of prompt and pattern variations:
from autoflow.analytics.experiments import ExperimentTracker
tracker = ExperimentTracker()
# Define an experiment
tracker.create_experiment(
name="new_classification_prompt",
description="Testing updated classification prompt template",
variants=["control", "variant_a"]
)
# Tag runs during execution (via feedback notes)
# Runs with "experiment:new_classification_prompt:variant_a" in notes
# are automatically grouped
# Analyze results
results = tracker.analyze("new_classification_prompt")
for variant, stats in results.items():
print(f"{variant}: {stats.pass_rate:.0%} pass, "
f"{stats.avg_quality:.2f} quality")Experiments use feedback note tagging — no code changes required to run variants. Tag feedback notes with the format experiment:{name}:{variant} to assign runs to experiment groups.