AutoFlow
Analytics

Quality Checks

Exit criteria, calibration, generation quality, and A/B experiments.

AutoFlow includes automated quality checkers that verify the system meets its design targets.

Exit Criteria

The ExitCriteriaChecker evaluates long-term improvement by comparing the first half of run history (baseline) against the second half (current):

python -m autoflow --exit-criteria

Targets:

CriterionTargetHow It's Measured
PASS rate improvement>= 10%Compare baseline vs current PASS rates
Edit time decrease>= 20%Compare baseline vs current avg estimated_human_edit_minutes
Pattern extraction>= 5 new patternsCount distinct patterns used in current half

All three criteria must be met for the check to pass.

Calibration

The CalibrationChecker measures whether the evaluator agent's correctness scores agree with human rating feedback:

python -m autoflow --calibration

How it works:

  1. Pairs each RunRecord's evaluator correctness score with the corresponding FeedbackRecord's human rating
  2. Scores within a configurable tolerance (default: 1 point) count as agreement
  3. Reports agreement rate — target is >= 85%

Example output:

Calibration Check
  Paired records: 47
  Agreement rate: 89.4% (target: 85.0%)
  Status: PASS

Generation Quality

The GenerationQualityChecker validates that workflow generation meets core quality targets:

python -m autoflow --generation-quality

Targets:

CriterionTargetWhat It Tests
Schema validation100% passAll catalog templates pass JSON Schema + custom rules
Import roundtrip>= 80% successExport → import cycle preserves workflow integrity
Quality score>= 4.0 averageAverage composite quality score across all templates

Uses the CatalogManager to discover all templates and test all three dimensions.

A/B Experiments

The ExperimentTracker supports A/B testing of prompt and pattern variations:

from autoflow.analytics.experiments import ExperimentTracker

tracker = ExperimentTracker()

# Define an experiment
tracker.create_experiment(
    name="new_classification_prompt",
    description="Testing updated classification prompt template",
    variants=["control", "variant_a"]
)

# Tag runs during execution (via feedback notes)
# Runs with "experiment:new_classification_prompt:variant_a" in notes
# are automatically grouped

# Analyze results
results = tracker.analyze("new_classification_prompt")
for variant, stats in results.items():
    print(f"{variant}: {stats.pass_rate:.0%} pass, "
          f"{stats.avg_quality:.2f} quality")

Experiments use feedback note tagging — no code changes required to run variants. Tag feedback notes with the format experiment:{name}:{variant} to assign runs to experiment groups.

On this page