Appendix
Dataset and methodology details
A. The srs-prompts Dataset
The srs-prompts dataset consists of 1,499 highlight-anchored memory prompts drawn from 475 highlights across 93 sources, primarily technical explainers, blog posts, and opinion pieces. Every source in the dataset was chosen according to authentic reader interest.
Highlights were collected by the authors and a small community of experienced memory system users via an online deployment, with most of the dataset labeled by the authors themselves. For a single highlight, multiple variants of a flashcard were generated using a variety of pipeline strategies and then labeled according to the tier taxonomy. A companion dataset, srs-highlights, aggregates these rows back to the highlight level: a single row contains the highlight information and all the rated flashcards.
The dataset is split 80/20, with no source appearing in both train and test:
| Tier | Train | Test | Total | % |
|---|---|---|---|---|
| T0 (Off-target) | 395 | 71 | 466 | 31.1% |
| T1 (Needs-refactor) | 295 | 39 | 334 | 22.3% |
| T2 (Needs-polish) | 120 | 12 | 132 | 8.8% |
| T3 (Ready-to-review) | 485 | 82 | 567 | 37.8% |
T2 is notably underrepresented. In practice, we tended to either accept a prompt outright (T3) or find a structural flaw that pushed it back to T1; the “needs minor tweaks” judgment turned out to be uncommon. This data distribution should be taken into account when considering any classification task that attempts to predict tier accuracy. That being said, the four tiers split naturally into two coarser groups which are much more balanced: unusable (T0 and T1, 53.4% of the dataset) and usable (T2 and T3, 46.6%).
Schema
Each row in the dataset represents a single memory prompt and contains the following fields:
| Field | Type | Description |
|---|---|---|
id | int | Unique prompt identifier |
task_type | string | Tier label: off-target, needs-refactor, needs-polish, or ready-to-review |
source_url | string | URL of the source document |
source_meta | dict | Metadata about the source: author, title, publication_date, summary |
highlight | string | The highlighted text passage |
highlight_interpretation | string | Model-produced summary of what the highlight indicates in context |
content | string | The memory prompt text (question and answer together) |
pluckable | bool | Binary signal: true for T2/T3, false for T0/T1 |
tags | list[string] | Failure mode labels for T0/T1 prompts |
highlight_id | int | Groups prompts that share the same highlight |
800 of the flashcards in the dataset record a failure. 466 of these are due to targeting (T0) and 334 of them are due to construction issues (T1). Of these failures, 630 of them carry at least one tag identifying the specific failure mode that occurred. Tags are not exclusive: a single prompt can be both off-target and wordy. They are also not exhaustive. The set we tagged is a useful subset of the failure modes that show up in practice, not a definitive ontology of every way a prompt can be unusable. The remaining 170 T0/T1 prompts which contain no tags were assigned a tier through human review but were not annotated with specific failure tags. The absence of a tag does not imply the absence of the failure mode. All tier assignments in the dataset are human-reviewed.
The highlight_interpretation field. Including the full source text in every evaluation call would be prohibitively expensive, so each highlight ships with a two-to-four-sentence synthetic summary of what the highlight indicates relative to its source. This is the compressed context that downstream judges and classifiers see in place of the article itself. The summaries were generated by a frontier language model. The pipeline that produces them was validated on a small sample by the authors, but we did not systematically measure the fidelity of every interpretation against its source, so individual entries may contain hallucinations, confabulations, or subtle mis-representations of the source material.
All experiments in Sections 2–3 rely on these interpretations except the arena in Section 6. In the arena, to prevent the effect of interpretation errors, each flashcard is generated using the full source text and the associated highlight.
B. Highlights Study
To assess the interaction feasibility and the potential impacts of our data-collection methodology, we conducted a small study (Grounding the Problem). The goal was to probe a deliberately strict version of the question: how much of a reader’s intent can an outside observer recover from their highlights alone. Often highlights are a critical element of sense making, and in turn, they may capture what the user saw, rather than what struck them. This means if highlights are a noisy proxy for intent, then even a perfect downstream generator will produce prompts about the wrong things.
For the purpose of srs-prompts (and the subsequent study), the exposure to this type of error is relatively limited. The dataset was collected under a much softer condition: the authors of the highlights knew that their highlights would be used to produce flashcards, so their highlights were purposely a mechanism of communication with the AI rather than sense-making.
From Highlights to Interests: Recovering What Readers Care About
For a more thorough analysis of this study — including the full per-article breakdown, the reader-type split in greater depth, and the implications for highlight-based personalization — see the workshop paper.
Download PDF →Participants and Materials
We recruited 51 participants from r/anki and X (formerly Twitter), then filtered for self-reported spaced repetition experience of at least three months, yielding 42 participants. Each was assigned one of three casual explainer articles based on their stated preferences in an entrance survey:
| Article | Interests | Example interest |
|---|---|---|
| Greening the Solar System | 12 | Silica aerogels could be manufactured on Mars from rock silica to create local greenhouse bubbles. |
| How the Bitcoin Protocol Actually Works | 10 | Attackers need >50% hash power to reliably rewrite history; with less, catch-up probability is tiny. |
| Three Kinds of Tacit Knowledge | 13 | Scientists couldn’t replicate TEA lasers from papers alone: success required tiny details the authors didn’t think to document. |
Before the study, we distilled each article into 10–13 discrete interests and then manually anchored each interest to one or more contiguous passages that best represent it. Participants read in a web interface equipped with a digital highlighter and were asked to mark whatever struck them. Afterward, they were shown the fixed interest list for their article and selected which ideas they would like turned into spaced repetition flashcards, knowing they would receive an Anki deck based on their selections. The selection task set a deliberately high bar: participants had to judge whether an idea was interesting enough to revisit over the coming weeks and months.
Readers differed substantially in what they found interesting. Pairwise Jaccard similarity, the fraction of interests two readers share out of all interests either selected, averaged just 45%. Removing the 19% of participants who selected every available interest (a group we call broad readers) drops average overlap further to 39%.
Prediction Strategies
To assess whether the highlights could be used to predict which interests a user found interesting, we evaluated three strategies for predicting which interests a reader would select, given their highlights. If the highlights could predict a user’s interest, then they could be used to personalize downstream experience (e.g., flashcard generation, summaries).
Baseline. The no-personalization default: surface every interest to every reader, the same deck for everyone. Since the average reader selects 7 of ~11 available interests, this achieves precision 0.63 with perfect recall.
Spatial overlap heuristic. Another option is to predict that an interest was selected whenever any highlight overlaps its anchored passage. This achieves precision 0.68 and recall 0.61 (F1 = 0.62). The everyone-gets-everything baseline surfaces every interest to every reader and achieves perfect recall but precision of only 0.63. The spatial overlap heuristic improves on this precision while accepting a small recall loss, filtering out irrelevant interests without sacrificing much of what readers want. That being said, 83% of readers have at least one selected interest whose anchor was never highlighted. As a coarse filter, it works; however, it is by no means a seamless predictor.
LLM inference. We tested three frontier models, GPT-5.2, Claude Opus 4.6, and Gemini 3 Pro, each given the full article with the reader’s highlights marked inline and the candidate interest list. GPT-5.2 with medium reasoning performed best and raised the F1 to 0.67. The improvement over the spatial heuristic is modest (+0.05), and a failure pattern emerges: both methods systematically over-predict for readers who select few interests.
The Reader-Type Split
Grouping readers into terciles by the number of interests they selected reveals three qualitatively different regimes:
| Reader type | n | Avg selected | % Intent | Heuristic F1 | Best LLM F1 |
|---|---|---|---|---|---|
| Broad (9–12 selected) | 14 | 10.5 | 94% | 0.82 | 0.87 |
| Mid (7–8 selected) | 13 | 7.4 | 79% | 0.64 | 0.70 |
| Selective (1–6 selected) | 15 | 3.7 | 45% | 0.41 | 0.47 |
% Intent measures, among the interests a reader highlighted, what fraction they then selected as worth turning into flashcards. For broad readers, attention and intent are 94% aligned: nearly everything they highlight, they also select. Prediction is easy, and personalization adds little over simply surfacing everything.
For selective readers, only 45% of their highlights which intersected an interest became selections. For every interest they both highlight and choose, there is roughly another they highlight but do not select as something they would want as a flashcard. For this group, LLM inference helps only marginally (F1 = 0.41 → 0.47). The gap lives in the signal itself.
What makes this especially striking is that selective readers do not highlight less than mid-range readers. Both groups cover roughly 7% of their article by word count. Selective readers simply translate fewer of those highlights into selections at the end. For them, highlighting functions as exploration rather than commitment: it marks careful reading and sense-making, not a decision to preserve an idea. The aggregate F1 of 0.67 therefore overstates how well highlights work as a signal.
C. Evaluation Experiments
All evaluation experiments shared the same input structure: source metadata (title, author, URL), the highlighted text, a synthetic interpretation of the highlight in source context, and the candidate memory prompt. This section provides full methodology for the evaluation experiments described in Section 2.
Binary Classification
Each model classified a single prompt as pluckable (T2/T3) or unpluckable (T0/T1). We tested two instruction variants: zero-shot (task description only) and few-shot. The few-shot condition prepended 11 annotated examples (4 pluckable and 7 unpluckable), each with a short explanation of the rating. The pluckable examples illustrated techniques like using vivid source language as a recall cue or capturing an implied weakness. The unpluckable examples spanned both construction failures (missing contrast, overly broad framing) and targeting failures (testing labels instead of surprise, yes/no trivialization).
Because none of these models were trained on srs-prompts, the binary-classification experiments do not use the 80/20 split from Section A; they evaluate on the entire dataset, minus only those sources whose prompts appear in the 11-example few-shot set (to prevent leakage). After that filter the evaluation set contains 1,198 prompts: 485 pluckable and 713 unpluckable. The table below extends the per-tier breakdown in Section 2 (which covers Opus 4.5 and Sonnet 4.5) to include all models tested. Each value is the fraction of prompts at that tier correctly classified as pluckable or unpluckable under the few-shot condition.
| Model | T0 (n=328) | T1 (n=265) | T2 (n=120) | T3 (n=485) |
|---|---|---|---|---|
| Sonnet 4.5 | 0.988 | 0.955 | 0.875 | 0.112 |
| GPT 5.2 | 0.899 | 0.736 | 0.450 | 0.452 |
| Gemini 2.5 Pro | 0.735 | 0.415 | 0.217 | 0.748 |
| GPT-OSS 120B | 0.619 | 0.358 | 0.142 | 0.761 |
| Qwen3 32B | 0.509 | 0.249 | 0.150 | 0.819 |
The per-tier numbers separate the models into two failure modes. Sonnet 4.5 is conservative: it catches almost every unusable prompt but also rejects most good ones (T3 accuracy 11.2%). Qwen3 32B is permissive: it approves most prompts regardless of quality (T3 accuracy 81.9%, T1 accuracy 24.9%).
Rubric
For each of the five failure categories described in Section 2, we wrote dedicated classification instructions with synthetic examples. Models judged each criterion independently: given a highlight and prompt, does this prompt exhibit the failure? Positive examples were drawn from the tagged subset of srs-prompts with the corresponding failure tag; negatives were pluckable prompts. Sample sizes varied by category (25–96 positive examples, 96 negatives each).
Preference Selection
The binary and rubric tests ask for absolute judgments. Preference selection tests whether relative comparison is easier: given a small set of candidates for the same highlight, can a model identify the best one?
We assembled 90 highlights, each with 2–4 candidate prompts. Every set included exactly one T3 prompt; the others were T1 or T2. T0 prompts were excluded from this experiment to isolate LLMs’ ability to infer construction quality. Candidate prompts were given anonymous identifiers (randomized letters and digits) so that position and label could not bias the selection.
Models received the highlight, its source context, and the full candidate set, then selected the single best prompt under this instruction. The setup is deliberately favorable: the human-preferred T3 prompt is always present. Even so, no model picked it reliably. Opus 4.5, the best performer, chose the T3 prompt about half the time and still selected a T1 prompt (the tier we most want to reject) in 32.6% of comparisons.
D. Training Experiments
Binary Classifiers
We trained binary cross-entropy classifiers to predict pluckability (T2/T3 vs. T0/T1) directly, sweeping three base models of increasing scale: ModernBERT-base (~145M parameters), ModernBERT-large (~395M), and Qwen3-0.6B (~600M). Each model received the full input context, source metadata, highlight, synthetic interpretation, and candidate prompt, concatenated into a single sequence of up to 2,048 tokens. Training used a cosine learning rate schedule with 5% warmup, sweeping learning rates (5e-6, 2e-5, 5e-5), batch sizes (8, 16), and epochs (1–3). Unlike the prompted-model experiments in Section C, classifier training requires source-disjoint train and test splits to avoid leakage, so the experiments here use the 80/20 split from Section A. To prevent any single source from dominating either side of the split, the loader caps each source at 12 prompts; this brings the held-out test split from 204 down to 130 examples. We optimized for ROC-AUC on that test set.
The best result came from Qwen3-0.6B at AUC 0.754. At a threshold of 0.425, this matches the precision–recall tradeoff of Gemini 2.5 Pro’s few-shot prompting, a model orders of magnitude larger. To check whether more capacity would help, we fine-tuned Qwen3-14B with LoRA (rank r=8), which produced an essentially identical AUC of 0.752. The two ROC curves track each other closely, which suggests that within this model family additional parameters aren’t where the gains will come from. Pushing AUC further likely requires more data rather than requiring a larger model.
Reward Model
Our tier structure implies a natural preference ordering (T3 > T2 > T1 > T0). Since each highlight aimed to contain both good and bad examples of flashcard quality, we can turn each pair of differently-tiered prompts into a preference example (with the higher-tier prompt as “chosen”). A highlight that carries prompts at T0, T1, and T3 therefore yields three training pairs: (T1 > T0), (T3 > T0), and (T3 > T1). Across the dataset, this expansion produces 353 training pairs and 93 evaluation pairs. That is small by the standards of public preference datasets such as HH-RLHF and UltraFeedback, which contain on the order of 100k pairs each.
We only trained reward models on a single base, Qwen3-0.6B, using TRL’s RewardTrainer on ChatML-formatted conversations. The hyperparameter sweep covered learning rates (5e-6, 2e-5, 5e-5), batch sizes (4, 8), and epochs (1–3), all with a cosine schedule and 5% warmup. The table below shows where the training signal concentrates:
| Preferred \ Rejected | T0 | T1 | T2 |
|---|---|---|---|
| T1 | 64 | — | — |
| T2 | 19 | 18 | — |
| T3 | 100 | 98 | 54 |
GRPO
To assess the effects of RL in this domain, we attempted to apply RL to an existing post-trained model to turn it into a competent tier classifier with GRPO (Group Relative Policy Optimization). GRPO samples a small group of completions per prompt, normalizes their rewards into advantages relative to the group mean, and applies a PPO-style update. The mechanism only produces a learning signal when rewards inside a group actually vary; if every completion in a group receives the same reward, the normalized advantages collapse to zero and the gradient vanishes. We used TRL’s GRPO implementation on Qwen3-32B and on the OpenPipe variant of Qwen3-14B-Instruct, training on 4×A100 and 2×H200 nodes and sweeping over batch sizes, LoRA ranks, and DeepSpeed stages. Across every configuration the trained models showed negligible improvement on tier accuracy over their starting checkpoints.
To understand why, we measured how much within-group variance the task produces in the first place. For each base model we sampled 8 completions per task at temperature 1.0 and recorded how often at least one received the correct tier label (pass@8). The complement — tasks where the model is unanimously right or unanimously wrong — is what determines how much of the dataset is wasted under GRPO.
| Model | Pass@8 | Zero-gradient tasks (of 212) |
|---|---|---|
| DeepSeek V3.1 | 60.4% | 116 |
| Qwen3-235B | 55.7% | 121 |
| Qwen3-32B | 50.9% | 110 |
| GPT-OSS 120B | 42.5% | 145 |
Even the strongest sampler (DeepSeek V3.1) leaves more than half the dataset unanimously right or wrong, and applying DeepSeek R1’s standard zero-gradient filtering strategy would leave only 96 of 212 tasks with any usable learning signal. Filtering helps in principle but runs straight into the second problem: the dataset is small, and aggressively pruning it shrinks an already thin training signal further. We never observed RL match SFT-quality behavior on the four-way task in any of the configurations we tried. See Section H for the SFT-then-RL direction we would try next.
E. Arena Implementation
Pipeline
The arena evaluates generation pipelines end-to-end. It runs on a subset of the srs-highlights dataset that we call the grounded split. A highlight is grounded if:
- It has at least two rated ground-truth prompts.
- These annotated prompts include at least one pluckable prompt (T2/T3) and one non-pluckable prompt that isn’t off-target.
- It comes from a source whose full text we could programmatically access at evaluation time, so the generating model could be given the article rather than a summary.
To prevent any single source from dominating, at most six highlights per source are included; sources with more valid highlights are randomly sampled down to six. This yields 212 highlights across 77 sources. For each highlight, the pipeline runs in three stages:
-
Generate. A model receives a system-level generation instruction and a user message containing the full source text, the highlighted passage, and basic metadata (title, author, URL).
-
Grade. Each generated prompt is individually scored by the grounded LLM judge. The judge assigns a tier (T0–T3) based on the rubric described in Section 2.
-
Score and rank. For each highlight, we compute a completion score from the set of graded prompts, then aggregate pairwise comparisons into Elo ratings. All models start at a base rating of 1500, ratings update with K=32, and the final ratings reported in the leaderboard are averaged over 10 randomized match orderings to wash out ordering effects.
All models receive identical inputs and the same generation instruction. The number of prompts per highlight is unbounded; models choose how many to generate. As a validation control, we also ran Gemini 2.5 Pro with a trivia-style instruction (described in Section 5). The control ranks last by a wide margin, confirming that the scoring pipeline penalizes structural failures as intended.
Grounded LLM-as-a-judge
All arena prompts are graded by Claude Sonnet 4.5 via the Anthropic API with extended thinking enabled (budget: 2,048 thinking tokens). The judge receives the full source context (title, URL, author, highlighted text, and highlight interpretation) along with all human-labeled reference prompts for the same highlight, sorted by tier so T0 examples appear first and T3 examples last. These reference prompts let the judge see the quality gradient for that particular highlight before it evaluates the candidate. The full instruction, including tier definitions, evaluation criteria, and explicit T1/T2 boundary checks, is in the project repository.
The classifier is imperfect, so for the arena to be meaningful we need to know how its errors are distributed. Calibration requires ground-truth tier labels that the judge has not seen during instruction development. The published srs-highlights grounded split contributes 212 of those. We reserve a private held-out set of 74 additional highlights drawn from three sources that appear in neither srs-prompts nor srs-highlights. Together they form a 286-highlight calibration set.
In terms of the judging instruction, we iterated on several instruction variants until we found one that best aligned with those human labels (using the held-out set to prevent over-fitting). Other approaches to improving the judge, GEPA prompt optimization and supervised fine-tuning, are documented in Section F.
Confusion matrix
The confusion matrix below reflects the selected instruction evaluated against the 286 held-out samples. Rows are human ground-truth tiers; columns are judge predictions.
| Judge T0 | Judge T1 | Judge T2 | Judge T3 | |
|---|---|---|---|---|
| Human T0 | 44 | 28 | 1 | 1 |
| Human T1 | 9 | 53 | 20 | 5 |
| Human T2 | 2 | 9 | 34 | 28 |
| Human T3 | 5 | 13 | 25 | 9 |
The errors here cluster in predictable directions rather than scattering uniformly: T2 and T3 are the hardest pair (only 9 of the 52 human-T3 prompts are labeled T3 by the judge, with most pulled down to T2), and the T1/T2 boundary is the second-largest source of confusion. Rather than trusting individual tier predictions, the scoring pipeline uses this matrix to compute the posterior P(human tier ∣ judge tier), then weighs each posterior by the asymmetric cost assignments described in the report. The full scoring and Elo aggregation implementation lives in the project repository.
Judge consistency
Sonnet 4.5 is sampled at temperature 1 with a 2,048-token thinking budget. The thinking trace is the dominant source of run-to-run variation: small differences in the chain of reasoning occasionally flip a borderline tier rating. To analyze this effect, we ran three independent evaluation passes over a sample of the validation set and measured agreement at two levels of granularity:
| Run | n | Tier κ | Tier % | Pluck κ | Pluck % |
|---|---|---|---|---|---|
| Run A | 101 | 0.325 | 51.5% | 0.636 | 84.2% |
| Run B | 101 | 0.446 | 60.4% | 0.582 | 82.2% |
| Run C | 101 | 0.363 | 54.5% | 0.616 | 83.2% |
| Mean ± SD | — | 0.378 ± 0.062 | 55.4% ± 4.5% | 0.611 ± 0.027 | 83.2% ± 1.0% |
Pluckability is stable run-to-run (κ between 0.582 and 0.636), while tier-level agreement is noticeably noisier. Because pluckability holds across runs, most of the flipping has to happen within a pluckability class: at the T0/T1 and T2/T3 boundaries rather than across T1/T2. The T2/T3 boundary is also where humans and the judge disagree most in the confusion matrix above, so a single evaluation pass cannot separate judge stochasticity from judge-vs-human bias; the calibration folds both into the same posterior.
F. Improving the Judge
The hand-crafted instruction described in Section E is the baseline this section tries to beat. We attempted two approaches: automated prompt optimization with GEPA, and supervised fine-tuning of an open-weights model.
Prompt Optimization
We used DSPy’s GEPA (Genetic-Pareto) to try to optimize the grounded tier classification instruction automatically. GEPA generates candidate instructions, evaluates them against training examples, and iteratively refines based on failure analysis. The program LM uses the same model as our hand-crafted baseline, Claude Sonnet 4.5, such that we could attribute any effect to the optimizer. GPT-5.2 served as the reflection model, with auto="light" mode, reflection minibatch size 6, and 32 threads.
Both the GEPA-optimized instruction and the hand-crafted baseline were then evaluated on the same 286 grounded highlights from the held-out calibration set, with Sonnet 4.5 as the underlying model for both:
| Instruction | Tier κ | Tier % | Pluck κ | Pluck % |
|---|---|---|---|---|
| Hand-crafted | 0.308 | 49.0% | 0.601 | 80.4% |
| GEPA-optimized | 0.271 | 44.1% | 0.519 | 75.5% |
The GEPA instruction performed worse on both tier accuracy (44.1% vs. 49.0%) and pluckability (75.5% vs. 80.4%). It produced a longer, more specific instruction with extra criteria that read like improvements on paper, but none of the additions sharpened the T1/T2 boundary.
Supervised Fine-Tuning
Rather than optimizing the instruction, we tried teaching the judgment directly. We fine-tuned Qwen3-14B-Instruct with LoRA on 180 grounded examples (45 per tier, balanced by design). Each example was formatted as a ChatML conversation: a system instruction, a user message containing source context and reference prompts with tier labels, and an assistant response with the tier classification. Training used assistant-only loss, computing gradients on the assistant’s tier prediction but not on the system or user turns. This makes the task effectively classification fine-tuning: the model learns to map a fixed input format to a tier label. We also tried full-sequence loss in early experiments and found it underperformed the assistant-only setup.
The hyperparameter sweep covered LoRA rank r over {1, 8, 16, 32, 64}, learning rates {5e-6, 2e-5, 5e-5}, batch sizes {4, 8}, and epochs {1–3}, with a cosine schedule, 5% warmup, LoRA alpha=32, dropout=0.05, all-linear target modules, and a max sequence length of 4,096. The best configuration was r=8, lr=2e-5, 2 epochs. We then re-evaluated this configuration against the same 286-highlight calibration set used in the GEPA comparison above, so the hand-crafted baseline numbers are identical across the two tables:
| Configuration | Tier κ | Tier % | Pluck κ | Pluck % |
|---|---|---|---|---|
| Hand-crafted (Sonnet 4.5) | 0.308 | 49.0% | 0.601 | 80.4% |
| Qwen3-14B + LoRA r=8 | 0.396 | 56.3% | 0.455 | 74.1% |
The fine-tuned model beats Sonnet 4.5 on tier accuracy (+7.3 points) but loses 6.3 points on pluckability. To assess the effect of rank, we ran a separate sweep over LoRA rank r ∈ {1, 8, 16, 32, 64} and found that rank had essentially no effect: every configuration landed within a few points of the others on both metrics. That points at a data ceiling rather than a model ceiling. The 14B backbone is already extracting whatever signal exists and is not sensitive to additional parameters.
In the case of the arena, the performance increase in tier accuracy is quite encouraging. The arena’s calibration step works directly from the four-way confusion matrix, so any gains in tiered classification accuracy would translate directly into a higher fidelity arena.
The final arena seen in this report still uses Sonnet 4.5 with the hand-crafted instruction: calling a hosted API is operationally simpler than serving a local 14B model, and the accuracy gap isn’t yet wide enough to justify the extra infrastructure. Future work on fine-tuned judges for this task remains exciting, though. Distillation from a Sonnet-class teacher and choosing a stronger base model both seem like compelling directions.
A cheap, locally-runnable judge that matches the hand-crafted Sonnet baseline would unlock cheaper arena runs, easy re-runs as new generators ship, deployment in settings where a per-highlight frontier API call isn’t viable, and (most importantly) the inner loops a local judge enables — test-time scaling, synthetic data augmentation, and reward-shaped training.
G. Known Limitations
A few caveats apply to the dataset and the methods built on it.
- Synthetic interpretations. The
highlight_interpretationfield is generated by a language model and may contain hallucinations or subtle reframings of the reader’s intent. - Source bias. Sources skew toward technical explainers, blog posts, and opinion pieces. The dataset contains no textbooks, limited narrative nonfiction, and no fiction. Results may not generalize to these content types.
- Small annotator pool. Most prompts were labeled by the authors, with contributions from a small community of experienced memory system users. The taxonomy may reflect idiosyncratic preferences that would not replicate with a larger, more varied pool. The flip side is that the dataset captures a relatively coherent notion of prompt quality, which can be useful for taste induction or personalized model alignment, where a consistent evaluative perspective is more valuable than annotator diversity.
- T2 underrepresentation. Only 8.8% of prompts are rated T2 (see distribution table in Section A).
- No longitudinal validation. Tier assignments were made at labeling time and never validated against the long-horizon reviews these prompts are designed to support. What stands in for that validation is the labelers’ own experience: each annotator had been running a personal SRS practice for several years, and a T3 rating reflects their compressed taste for what survives months of review (see the report for more on why we treat that taste as a compressed signal).
H. Future Research
A few directions we did not have time to pursue but think would be high-leverage extensions of this work:
-
Which grounded references carry the signal? The grounded judge sees every reference prompt for a highlight, sorted by tier. We never measured which references have the most effect on the judgement quality. Future work should look to see whether ratings from other highlights can be repurposed to anchor the judge dynamically at runtime (enabling a RAG-like workflow).
-
Reference budget vs. judge accuracy. Related to the above, but framed as a budget question: what is the smallest reference set per highlight that preserves the current calibration quality? The answer determines how cheaply the judge can be extended to incorporate ratings from new highlights and sources.
-
Local judge distillation. As Section F argues, a local judge that matches the hand-crafted Sonnet baseline is the single change that would unlock most of the report’s downstream applications. Distillation from a Sonnet-class teacher, SFT followed by preference optimization, or a stronger open base are all candidate paths.
-
SFT-then-RL for the four-way classifier. SFT to install format and a usable prior, then GRPO to sharpen the T1/T2 boundary, is the most natural next experiment.
In academic work, please cite this as:
Ozzie Kirkby and Andy Matuschak, “Memory Machines: Can LLMs create lasting flashcards from readers’ highlights?”, https://memory-machines.com/report, San Francisco (2026).