Prioritization and the Lost-in-the-Middle Effect
The architecture of forgetting
The previous page mentioned the lost-in-the-middle effect. Now for the uncomfortable details.
In 2024, Stanford and Meta researchers published "Lost in the Middle: How Language Models Use Long Contexts." The findings confirmed what practitioners had suspected: language models process information at different positions with measurably different accuracy. When relevant information appeared in the middle of long contexts, performance dropped substantially compared to the same information at the beginning or end.
Plot accuracy against information position and you get a distinctive U-shaped curve. This is not subtle. In some configurations, GPT-3.5-Turbo performed worse with documents containing the answer than with no documents at all when that answer appeared mid-context. Providing more context actively hurt performance.
How bad is it?
The research provides specific numbers.
In multi-document question answering with 20-30 documents, performance dropped more than 20% when the relevant document moved from the beginning or end to the middle. The degradation was consistent across tested models including GPT-3.5-Turbo, Claude, MPT-30B-Instruct, and LongChat-13B.
The NoLiMa benchmark from 2025 reinforced these findings at larger scales. At 32,000 tokens, 11 of 12 models dropped below 50% of their short-context baseline. The degradation followed the predictable U-shaped pattern where boundary positions outperformed middle positions.
Research from Chroma in 2025 added a counterintuitive twist. Models performed better on shuffled document collections than on logically structured ones. Why? Logical structure placed certain documents predictably in the middle, where they received less attention. Random shuffling distributed relevant content across positions, increasing the probability that some would fall at favorable boundaries.
Why attention distributes unevenly
The U-shaped curve emerges from how transformers work, not from any particular model's training.
Causal masking
Autoregressive language models use causal attention masks that prevent tokens from attending to future positions. When processing position 1000, the model can attend to positions 1-999 but not 1001+. Combined with softmax normalization in attention, this creates a structural bias toward earlier tokens. Preceding tokens undergo more attention operations, accumulating influence.
MIT research in 2025 traced the primacy effect to this mechanism. The analysis found that "causal-style processing combined with uniform long-term retrieval demands creates primacy." The effect appeared in 73 of 104 test instances across models and tasks.
Rotary position embedding decay
Most modern language models LLaMA, Qwen, DeepSeek, Mistral, and the models underlying Claude and Codex use Rotary Position Embedding (RoPE) to encode position information. RoPE applies rotation matrices to query and key vectors, with the rotation angle determined by token position.
When computing attention between two tokens, the result depends on their relative distance. As distance increases, the angular misalignment between rotated vectors grows, systematically reducing their dot product. Nearby tokens receive higher attention weights than distant ones. This is recency bias, built into the math.
The decay is measurable. Research on RoPE dimension efficiency found that early embedding dimensions (those with high rotation frequencies) become functionally inactive for long-context retrieval. When positions span large ranges, these dimensions rotate through such wide angles that they cannot maintain consistent contributions to attention scores. Models learn to assign near-zero weights to these dimensions, effectively wasting computational capacity.
Training data distribution
Models learn position awareness from training data, and training data is not uniformly distributed. Research found that position indices at or below 1,024 account for more than 80% of training positions. Position indices at or above 1,536 constitute less than 5%.
Models develop strong intuitions about nearby positions through extensive exposure. Distant positions remain underrepresented in training, leading to unreliable generalization when contexts exceed the well-trained range.
How these factors combine
Primacy bias from causal masking favors early tokens. Recency bias from RoPE decay favors recent tokens. Training bias provides stronger gradients for boundary positions.
The result is the U-shape. Information at the beginning benefits from accumulated attention operations. Information at the end benefits from minimal rotational misalignment. Information in the middle suffers from both distance decay and diluted attention.
The degradation is not uniform across the middle. Research shows a gradual decline from the start, reaching a minimum around 50-70% of the context position, then gradually improving toward the end. The worst-performing region is not the exact center but a broad zone spanning roughly the middle third.
What to do about it
These mechanics translate to specific patterns for structuring context.
Place critical information at boundaries
The most reliable positions are the beginning and end. For AI coding agents:
- CLAUDE.md and AGENTS.md files load first, occupying privileged early positions
- System prompts appear at the start of every interaction
- The current query appears at the end, where recency bias provides maximum attention
- Instructions repeated at context end receive more attention than the same instructions mid-context
Claude's documentation recommends placing long documents at the top of prompts, above queries and instructions, with the final query at the bottom. This arrangement can improve response quality by up to 30% for complex multi-document inputs.
Use structural markers
The lost-in-the-middle effect applies most to unmarked content. Structural elements like headers, XML tags, and explicit markers can partially mitigate the effect by creating attention anchors within the middle zone.
<critical_requirement>
All API endpoints must validate authentication tokens before processing requests.
</critical_requirement>Explicit marking does not eliminate positional bias, but it creates stronger signals that compete with the architectural tendency to deprioritize middle content.
Restate key points strategically
For long sessions where critical information necessarily falls in the middle, strategic restatement moves it toward the end. State the requirement, provide supporting context, then restate the requirement before conclusion.
For multi-turn conversations with agents:
- Early turns: establish core constraints and architecture decisions
- Mid-session: explore and iterate details here are most vulnerable to being forgotten
- Before completion: restate critical requirements and validate against original intent
The final restatement leverages recency bias to ensure key constraints remain active when generating output.
Prefer targeted context over comprehensive context
The lost-in-the-middle effect intensifies with context length. More documents mean more middle positions where information can be lost.
The Chroma research finding is worth remembering: models performed better on shuffled documents because shuffling disrupted the predictable middle-positioning of certain content. Curating a smaller set of relevant documents often outperforms including a comprehensive set where relevant content becomes diluted.
For coding agents:
- Read specific functions rather than entire files when possible
- Include only the documentation sections relevant to the current task
- Skip the "just in case" context
Working with the architecture
The lost-in-the-middle effect cannot be eliminated through prompting. It emerges from transformer architecture, positional encoding, and training data distributions. Future model generations may reduce its severity, but current models exhibit the pattern consistently.
Treat positional bias as a constraint to work within. Place critical information at boundaries. Accept that middle-positioned content receives less attention. Restate key requirements at the end of long prompts.
The next page examines context compression: how compaction works, what survives the process, and how to influence what information persists when context is condensed.