Memory Machines

Can LLMs create lasting flashcards from readers' highlights?

Read the report The full research writeup — evaluation, training, grounding, and arena results

Memory systems make memory a choice—but only if you write practice prompts that effectively reinforce those ideas. That often demands more effort or skill than users can muster—and prompts can’t easily evolve or deepen over time.

Could we make memory as effortless as using a highlighter? We explored whether LLMs could convert casual highlights into useful memory prompts. We found that models can usually identify the intent of highlights, but struggle to generate prompts that will survive long-term review.

The Problem

Unlike a typical textbook practice problem, a memory prompt must cue the same retrieval over time: today, then again in three months, then again in a year. Ambiguous or imprecise prompts produce frustration and forgetting, while stable prompts compound. To study this problem, we constructed a dataset of ~1,500 highlight-anchored prompts across 93 sources and labeled their quality with a four-tiered system.

The Boundary

Across every setup, models reliably reject prompts that are clearly off-target (T0). But they struggle with construction quality—specifically, they can’t reliably distinguish prompts that look plausible but degrade in review (T1) from those that need polish but hold up over time (T2). That’s a problem because T1 prompts are expensive failures. They often seem fine at a glance, only to produce frustration and confusion in later review.

We saw the same pattern across elaborate prompting, rubric-based evaluation, preference learning, fine-tuning, and reinforcement learning: the targeting signal responds to optimization; the construction signal does not.

Grounding

Despite that inconsistent judgment, we found a way to approximately evaluate models. Instead of classifying a prompt in absolute terms, a grounded judge compares it to labeled prompts from the same highlight. This relative approach doesn’t need the model to fully internalize our taste; the standard is fixed in the reference set instead. Under grounding, precision rises from 56% to 78%, while recall falls modestly from 86% to 77%. The judge becomes more conservative, but the ones it approves are far more likely to be usable.

Arena Results

Drawing from highlights in the dataset, the grounded judge rated every model’s output; those ratings aggregate into the Elo rankings below. GPT-5.2 ranks first—and still produces unusable prompts ~36% of the time.

16 models scored by a grounded judge on 200 highlights, with cost-sensitive Elo ratings that penalize plausible-looking failures more heavily than obvious ones.

# Model Elo Usable % Win Rate
1 GPT-5.2
1630.0
64.3% 53.3%
2 OpenAI o3
1588.0
54.3% 47.1%
3 GPT-5.5 post-report
1579.3
55.2% 47.0%
4 Gemini 2.5 Pro
1564.7
52.1% 46.0%
5 Claude Opus 4.6
1544.0
52.8% 45.4%
6 Claude Opus 4.5
1542.2
49.8% 44.8%
7 Gemini 3.1 Pro
1532.9
51.5% 42.0%
8 Gemini 3 Pro
1529.6
54.6% 46.4%
9 Claude Sonnet 4.5
1490.2
48.3% 42.0%
10 GPT-4.1
1482.6
41.0% 34.4%
11 Claude Sonnet 3.7
1481.2
44.2% 37.1%
12 GPT-5.4
1463.3
46.3% 37.7%
13 Claude Opus 4.7
1462.7
47.4% 41.7%
14 Gemini 3 Flash
1451.7
45.6% 38.2%
15 GPT-4o
1427.8
29.6% 27.1%
16 Claude Haiku 3
1401.3
26.4% 26.9%
17 Gemini 2.5 Pro (QA) control
1328.4
24.6% 22.5%
Read the full report
Evaluation, training, grounding, and arena methodology