Quality Checks

AutoFlow includes automated quality checkers that verify the system meets its design targets.

Exit Criteria

The ExitCriteriaChecker evaluates long-term improvement by comparing the first half of run history (baseline) against the second half (current):

python -m autoflow --exit-criteria

Targets:

Criterion	Target	How It's Measured
PASS rate improvement	>= 10%	Compare baseline vs current PASS rates
Edit time decrease	>= 20%	Compare baseline vs current avg estimated_human_edit_minutes
Pattern extraction	>= 5 new patterns	Count distinct patterns used in current half

All three criteria must be met for the check to pass.

Calibration

The CalibrationChecker measures whether the evaluator agent's correctness scores agree with human rating feedback:

python -m autoflow --calibration

How it works:

Pairs each RunRecord's evaluator correctness score with the corresponding FeedbackRecord's human rating
Scores within a configurable tolerance (default: 1 point) count as agreement
Reports agreement rate — target is >= 85%

Example output:

Calibration Check
  Paired records: 47
  Agreement rate: 89.4% (target: 85.0%)
  Status: PASS

Generation Quality

The GenerationQualityChecker validates that workflow generation meets core quality targets:

python -m autoflow --generation-quality

Targets:

Criterion	Target	What It Tests
Schema validation	100% pass	All catalog templates pass JSON Schema + custom rules
Import roundtrip	>= 80% success	Export → import cycle preserves workflow integrity
Quality score	>= 4.0 average	Average composite quality score across all templates

Uses the CatalogManager to discover all templates and test all three dimensions.

A/B Experiments

The ExperimentTracker supports A/B testing of prompt and pattern variations:

from autoflow.analytics.experiments import ExperimentTracker

tracker = ExperimentTracker()

# Define an experiment
tracker.create_experiment(
    name="new_classification_prompt",
    description="Testing updated classification prompt template",
    variants=["control", "variant_a"]
)

# Tag runs during execution (via feedback notes)
# Runs with "experiment:new_classification_prompt:variant_a" in notes
# are automatically grouped

# Analyze results
results = tracker.analyze("new_classification_prompt")
for variant, stats in results.items():
    print(f"{variant}: {stats.pass_rate:.0%} pass, "
          f"{stats.avg_quality:.2f} quality")

Experiments use feedback note tagging — no code changes required to run variants. Tag feedback notes with the format experiment:{name}:{variant} to assign runs to experiment groups.

Quality Checks

Exit Criteria

Calibration

Generation Quality

A/B Experiments

On this page