Memory Machines — can LLMs make flashcards that last?

Read the report The full research writeup — evaluation, training, grounding, and arena results

Memory systems make memory a choice—but only if you write practice prompts that effectively reinforce those ideas. That often demands more effort or skill than users can muster—and prompts can’t easily evolve or deepen over time.

Could we make memory as effortless as using a highlighter? We explored whether LLMs could convert casual highlights into useful memory prompts. We found that models can usually identify the intent of highlights, but struggle to generate prompts that will survive long-term review.

The Problem

Unlike a typical textbook practice problem, a memory prompt must cue the same retrieval over time: today, then again in three months, then again in a year. Ambiguous or imprecise prompts produce frustration and forgetting, while stable prompts compound. To study this problem, we constructed a dataset of ~1,500 highlight-anchored prompts across 93 sources and labeled their quality with a four-tiered system.

The Boundary

Across every setup, models reliably reject prompts that are clearly off-target (T0). But they struggle with construction quality—specifically, they can’t reliably distinguish prompts that look plausible but degrade in review (T1) from those that need polish but hold up over time (T2). That’s a problem because T1 prompts are expensive failures. They often seem fine at a glance, only to produce frustration and confusion in later review.

We saw the same pattern across elaborate prompting, rubric-based evaluation, preference learning, fine-tuning, and reinforcement learning: the targeting signal responds to optimization; the construction signal does not.

Grounding

Despite that inconsistent judgment, we found a way to approximately evaluate models. Instead of classifying a prompt in absolute terms, a grounded judge compares it to labeled prompts from the same highlight. This relative approach doesn’t need the model to fully internalize our taste; the standard is fixed in the reference set instead. Under grounding, precision rises from 56% to 78%, while recall falls modestly from 86% to 77%. The judge becomes more conservative, but the ones it approves are far more likely to be usable.

Arena Results

Drawing from highlights in the dataset, the grounded judge rated every model’s output; those ratings aggregate into the Elo rankings below. GPT-5.2 ranks first—and still produces unusable prompts ~36% of the time.

19 models scored by a grounded judge on 200 highlights, with cost-sensitive Elo ratings that penalize plausible-looking failures more heavily than obvious ones.

#	Model	Elo	Usable %	Win Rate
1	GPT-5.2	1630.1	64.3%	53.6%
2	Claude Sonnet 5 post-report	1592.5	58.0%	50.6%
3	OpenAI o3	1587.3	54.3%	47.3%
4	GPT-5.5 post-report	1578.9	55.2%	47.1%
5	Gemini 2.5 Pro	1565.7	52.1%	46.5%
6	Claude Opus 4.6	1541.0	52.8%	45.3%
7	Claude Opus 4.5	1536.9	49.8%	44.9%
8	Gemini 3.1 Pro	1532.8	51.5%	41.9%
9	Gemini 3 Pro	1525.9	54.6%	46.2%
10	Claude Fable 5 post-report	1498.1	47.4%	43.3%
11	GPT-4.1	1486.4	41.0%	34.9%
12	Claude Sonnet 4.5	1482.6	48.3%	41.8%
13	Claude Sonnet 3.7	1478.4	44.2%	37.7%
14	Claude Opus 4.8 post-report	1471.3	50.7%	40.0%
15	GPT-5.4	1460.6	46.3%	37.4%
16	Claude Opus 4.7	1454.5	47.4%	40.9%
17	Gemini 3 Flash	1443.3	45.6%	37.9%
18	GPT-4o	1424.5	29.6%	27.9%
19	Claude Haiku 3	1389.9	26.4%	27.1%
20	Gemini 2.5 Pro (QA) control	1319.1	24.6%	22.5%

Read the full report

Evaluation, training, grounding, and arena methodology

Can LLMs create lasting flashcards from readers' highlights?

The Problem

The Boundary

Grounding

Arena Results