AutoTuner: The Only A/B Testing Framework That Isolates What Actually Matters
Every autocorrect system has two jobs: find candidate words, then pick the best one. Current testing compares entire pipelines end-to-end — so when one algorithm beats another, you can't tell whether it found better candidates or just ranked them better. You're measuring two variables at once and calling it one answer. AutoTuner fixes this. One independent variable. Clean results.
What Makes This Different
Shared Candidate Pool — For any misspelled input, the system retrieves all plausible corrections once. That identical set gets handed to every ranking algorithm under test. If Algorithm A picks "receive" and Algorithm B picks "relieve" for "recieve," the difference is purely in scoring — not candidate retrieval.
Pre-Registered Experiments — Before a test runs, the system locks hypothesis, significance threshold, stopping rule, and algorithm definitions into a tamper-evident audit ledger. You cannot change your hypothesis after seeing results without detection.
Cryptographic Audit Trail — Every event recorded in a tamper-evident append-only audit trail. Independent verification available on demand.
Real Spelling Errors — Tests run on naturally occurring misspellings from established linguistic corpora — not synthetic typos.
Psycholinguistic Intelligence — Rankings can incorporate laboratory-measured cognitive data — word recognition speed, identification accuracy, visual competitor density.
Multi-Metric Statistical Rigor — Accuracy, confidence, processing time, and intervention count measured simultaneously with paired-samples t-tests and effect sizes.
Empirical Validation: 500-Stimulus Experiment
Three ranking conditions sharing the same candidate pool. Frequency-based ranking achieved 0.236 mean accuracy. Psycholinguistic-only ranking: 0.162 (p < 0.001, d = 0.197). Combined model: 0.190 (p = 0.001, d = 0.145). The frequency-based ranker won — and that's the point. A testing system that only produces positive results is worthless. This one tells the truth.
Who This Is For
Product teams building autocorrect — Isolate the ranking question when evaluating new models.
Researchers evaluating algorithms — Publish with pre-registered hypotheses and verifiable audit trails.
Compliance-sensitive organizations — Proof of experimental integrity with timestamped, verified records.
Anyone tired of confounded A/B tests — Measure one thing at a time.