A three-arm randomized controlled trial isolating the effect of expert human tutoring dosage on standardized math achievement.
Bloom's 1984 finding is one of the most cited results in education: one-on-one tutoring moved students from the 50th to the 98th percentile. Forty years later, no one has properly replicated it.
The tutors got worse. Replication attempts had severe cost constraints. Saga Education, the study most known for tutor quality, paid tutors $25,500 a year with primarily self-serve digital training, tutoring two students at a time.
The assessments changed. Bloom introduced content novel to students. Later studies moved to standardized measures like MAP, requiring much longer interventions to show meaningful impact.
The designs got muddled. Subsequent replications introduced confounds — varying dosage and tutor quality simultaneously, swapping individuals for small groups, or using less-trained tutors with a fixed curriculum.
We have mapped the floor of tutoring effectiveness. The ceiling remains unknown.
A three-arm RCT that isolates dosage while holding tutor quality constant. All tutoring is virtual, one-on-one, delivered by expert instructors trained on the highest-effect interventions in the research literature.
| Daily | Weekly | Control | |
|---|---|---|---|
| Tutor | Expert instructor | Expert instructor | None |
| Dosage | Daily 1-on-1 (45 min) | Weekly 1-on-1 (45 min) | Standard instruction |
| Duration | Two semesters | Two semesters | Two semesters |
Outcome
MAP Growth (nationally normed, 11M+ students)
Sample
Grades 2–4, N=150, enrolled via lottery
Power
82% to detect 4 RIT points (~0.5 SD)
A typical session might proceed: review prior work (5 min), extend a concept at the student's frontier (15 min), guided practice with feedback (15 min), assessment or reflection (10 min). But this is illustrative, not prescriptive. A tutor who decides a student needs 45 minutes on a single problem type is free to do that.
AI tutoring is the most active frontier in education technology. But the quality of an AI tutor can only be measured against a clear picture of what great human tutoring looks like. That picture does not exist. This study produces it: the first rigorous benchmark for what elite human tutoring achieves on a nationally normed assessment.
Tutors are experienced educators with at least three years of classroom or intensive tutoring experience in elementary mathematics and demonstrated records of student growth. We expect to hire 12–14 for the daily arm and 3–4 for the weekly arm, each serving 4–5 students on staggered schedules.
All tutors complete a two-week (40-hour) training program designed with leading education researchers, covering: the evidence base for each intervention strategy; diagnostic use of MAP data; the structured decision-capture protocol; and supervised practice sessions with feedback.
Three streams:
MAP Growth RIT scores. Nationally normed, computer-adaptive assessments administered at baseline, mid-year, and endline.
Session recordings and interaction logs. Browser-level data capturing every problem attempted, resource used, and tool opened.
Tutor decision journals. Structured entries completed after each session: the student's state, what the tutor chose to focus on, what worked, and what they plan to adjust.
Together these reconstruct not just whether expert tutoring works, but how — the micro-decisions that constitute expert instruction.
Phase 1: Virtual study. This pilot is self-contained and answers its own research question — does expert tutoring produce large effects on standardized measures?
Phase 2: In-school RCT. If Phase 1 confirms large effects, the second phase compares tutored students directly against classroom instruction in physical schools.
Independent value. The pilot also produces tutor training protocols, data infrastructure, and a detailed corpus of expert tutoring behavior — a behavioral benchmark for training and evaluating AI tutoring systems.
| Phase | Duration | Activities |
|---|---|---|
| Preparation | Months 1–3 | IRB, researcher partnerships, tutor recruitment, training design |
| Training | Month 4 | Tutor training, lottery, consent, baseline MAP, randomization |
| Semester 1 | Months 5–10 | ~15 weeks tutoring, mid-semester MAP |
| Semester 2 | Months 11–16 | ~15 weeks tutoring, mid-semester MAP |
| Analysis | Months 16–18 | Endline MAP, data cleaning, analysis, manuscript |
We're looking for experienced math educators (grades 2–4) with at least 3 years of classroom or intensive tutoring experience and demonstrated records of student growth.