The quality bar

Every question, peer-reviewed by two AI models.

Most SAT-prep apps let one AI model both write and grade itself. That's how you get "AI slop" — mathematically wrong questions, broken distractors, off-style stems. We don't. Every question is generated by one model, verified by an independent second, and only then tagged as ready.

Live distribution

The honest number, right now.

This chart updates from the live database. We refuse to round up. If a question hasn't been through review yet, it shows up as "Pending review" — not silently relabeled.

How the pipeline works.

1
Model A generates the question
A specialist model takes a College Board skill statement (e.g. 'Linear equations in 1 variable, medium difficulty') and produces a stem, distractors, correct answer, and step-by-step explanation. This is the same kind of model behind the AI tutors competitors ship — it's just step 1 for us, not step 1 *and* 2.
2
Model B verifies the answer
A second, independent model rechecks the math (for quantitative questions) or the grammar/logic (for R&W). If the verifier solves the question and disagrees with Model A's answer, the question is dropped. If it agrees, the question moves on.
3
A student-persona model evaluates
A third model plays the role of a student attempting the question. It scores 1-10 on: readability, distractor plausibility, fairness, and 'does this feel like a real SAT question.' Anything under 6/10 is rejected. Score and feedback are stored on the question.
4
CB metadata enrichment
Approved questions get the College Board's full domain/skill metadata attached — frequency on the real test, benchmark correct rate, reference question count. This is the same data that powers the 'Authority badge' shown to students after they answer.
5
Audio narration synthesis
A separate cron generates step-by-step audio explanations using ElevenLabs voice. Every question gets a coach narration that's mathematically formatted (5x → 'five x', x² → 'x squared') so it sounds natural, not robotic.

How this compares to OnePrep, Acely, AlphaTest.

Reviewers consistently flag those tools for AI slop. The pattern is the same: generate with one model, ship without independent verification. We've seen the failure mode and engineered the pipeline specifically to prevent it.

EVERYONE ELSE

One model writes and grades itself.

Quality bar is whatever the writer agrees with. Hallucinations and bad math slip through routinely — see the 'AI slop' complaints in any review of OnePrep, Acely, or AlphaTest.

FINISHSTRONG

Two models, no self-grading.

Generation by Model A, verification by Model B, student-persona evaluation by Model C. All three must agree before the question reaches a student. The badge on every explanation tells you which review tier.

SEE THE FULL COMPARISON →

Common questions.

What does 'peer-reviewed by two AI models' actually mean?: Each question is generated by one AI model and then independently verified by a second. The verifier rechecks the answer mathematically (for math) or grammatically (for R&W) and grades the question against the College Board's published skill statements. If the second model disagrees, the question is dropped, not shipped.
How do you handle 'AI slop' — questions that feel off?: The peer-review pipeline catches most of it. Beyond that, every question is also evaluated by a third model playing the role of a student, scoring readability, distractor plausibility, and 'feels like a real SAT question.' Anything below 6/10 is rejected. The 'Pending review' tier on the chart above is questions the pipeline hasn't gotten to yet — we'd rather show that count honestly than label them as reviewed.
Why does the chart show some questions as 'pending review'?: The peer-review pipeline came online after the initial corpus was built. Older questions are working their way through retroactively. The number on the chart updates in real time as the queue drains. We refuse to mark them 'reviewed' until they actually are.
How does this compare to other SAT prep apps?: Reviews of OnePrep, Acely, and AlphaTest consistently flag generated questions that are mathematically wrong, factually misleading, or stylistically off — what students call 'AI slop.' Most of those tools let one model both write and grade itself. We don't. The peer-review badge on every explanation tells you exactly which review tier each question is at.
Will this scale as the corpus grows?: Yes — it's already running. New questions generated daily (by the 05:00 UTC cron) go through the same two-model peer-review before they're tagged as ready. The pipeline produces 8 fresh peer-reviewed questions per day, every day.

See the quality for yourself.

Today's 8-question daily challenge is peer-reviewed and free. No signup. ~5 minutes.

PLAY TODAY'S CHALLENGE →READ THE SCIENCE

How the pipeline works.

Model A generates the question

A specialist model takes a College Board skill statement (e.g. 'Linear equations in 1 variable, medium difficulty') and produces a stem, distractors, correct answer, and step-by-step explanation. This is the same kind of model behind the AI tutors competitors ship — it's just step 1 for us, not step 1 *and* 2.

Model B verifies the answer

A second, independent model rechecks the math (for quantitative questions) or the grammar/logic (for R&W). If the verifier solves the question and disagrees with Model A's answer, the question is dropped. If it agrees, the question moves on.

A student-persona model evaluates

A third model plays the role of a student attempting the question. It scores 1-10 on: readability, distractor plausibility, fairness, and 'does this feel like a real SAT question.' Anything under 6/10 is rejected. Score and feedback are stored on the question.

CB metadata enrichment

Approved questions get the College Board's full domain/skill metadata attached — frequency on the real test, benchmark correct rate, reference question count. This is the same data that powers the 'Authority badge' shown to students after they answer.

Audio narration synthesis

A separate cron generates step-by-step audio explanations using ElevenLabs voice. Every question gets a coach narration that's mathematically formatted (5x → 'five x', x² → 'x squared') so it sounds natural, not robotic.

How this compares to OnePrep, Acely, AlphaTest.

EVERYONE ELSE

One model writes and grades itself.

Quality bar is whatever the writer agrees with. Hallucinations and bad math slip through routinely — see the 'AI slop' complaints in any review of OnePrep, Acely, or AlphaTest.

FINISHSTRONG

Two models, no self-grading.

Common questions.

What does 'peer-reviewed by two AI models' actually mean?

Each question is generated by one AI model and then independently verified by a second. The verifier rechecks the answer mathematically (for math) or grammatically (for R&W) and grades the question against the College Board's published skill statements. If the second model disagrees, the question is dropped, not shipped.

How do you handle 'AI slop' — questions that feel off?

The peer-review pipeline catches most of it. Beyond that, every question is also evaluated by a third model playing the role of a student, scoring readability, distractor plausibility, and 'feels like a real SAT question.' Anything below 6/10 is rejected. The 'Pending review' tier on the chart above is questions the pipeline hasn't gotten to yet — we'd rather show that count honestly than label them as reviewed.

Why does the chart show some questions as 'pending review'?

The peer-review pipeline came online after the initial corpus was built. Older questions are working their way through retroactively. The number on the chart updates in real time as the queue drains. We refuse to mark them 'reviewed' until they actually are.

How does this compare to other SAT prep apps?

Reviews of OnePrep, Acely, and AlphaTest consistently flag generated questions that are mathematically wrong, factually misleading, or stylistically off — what students call 'AI slop.' Most of those tools let one model both write and grade itself. We don't. The peer-review badge on every explanation tells you exactly which review tier each question is at.

Will this scale as the corpus grows?

Yes — it's already running. New questions generated daily (by the 05:00 UTC cron) go through the same two-model peer-review before they're tagged as ready. The pipeline produces 8 fresh peer-reviewed questions per day, every day.