Exercise: Multi-File Refactoring with Context Management

Overview

The Gilded Rose Kata is a refactoring exercise that will push your session past 30 turns. That is the point. A single function with nested conditionals needs to become a multi-file polymorphic design, and you cannot get there without managing context along the way.

The refactoring itself is well-documented elsewhere. Here, the focus is on what happens to your agent session as it stretches across many turns: when to compact, how scratchpads compensate for memory loss, and whether plan mode actually prevents wasted work.

The task

The Gilded Rose is an inn that sells items whose quality and sellIn values change daily according to rules that have accumulated into an unreadable conditional. The legacy code handles all item types in a single function.

The refactoring objectives:

Extract each item type into its own class or module (Aged Brie, Backstage Passes, Sulfuras, normal items)
Add support for a new "Conjured" item type that degrades twice as fast as normal items
Keep the test suite passing throughout
Reach a design where adding new item types means adding code, not modifying existing code

The complex conditional logic and the requirement to add a new feature make this an ideal context management test. You will be tempted to rush the extraction. Resist. The exercise works best when you slow down enough to observe what is happening to your session.

Setup

Clone the repository and choose your preferred language:

git clone https://github.com/emilybache/GildedRose-Refactoring-Kata.git
cd GildedRose-Refactoring-Kata

The kata is available in 30+ languages. TypeScript or Python work well for this exercise:

TypeScript setup:

cd TypeScript
npm install
npm test

Python setup:

cd python
pip install pytest
pytest

Verify that the existing tests pass. Look at the updateQuality() function and appreciate what you are about to untangle. Spend 5-10 minutes reading before starting the agent session.

The experiment

This refactoring will take many turns. The exercise structures the work into phases that practice specific context management techniques. Use Claude Code the session commands (/compact, /clear, plan mode) are what you are practicing.

Phase 1: Discovery with plan mode (turns 1-5)

Enter plan mode before touching any code. The agent will use read-only tools to analyze the codebase without being tempted to start implementing.

Your prompt:

Analyze this Gilded Rose codebase. I need to:
1. Refactor the nested conditionals in updateQuality()
2. Extract each item type into its own class
3. Add a new "Conjured" item type

Before making any changes, analyze the existing code and create a refactoring plan.
Identify the distinct item behaviors and propose a target architecture.

What to practice:

Enter plan mode with Shift+Tab twice
Watch how plan mode keeps the agent from jumping into code
Exit plan mode only when the plan is complete

What to notice:

How many files did the agent read?
Did it find all the item types?
Does the proposed architecture make sense before any code exists?

Phase 2: Characterization testing (turns 6-12)

Before refactoring, add tests that capture current behavior. These tests are your safety net and become context the agent can reference later.

Your prompt:

Add characterization tests for the existing updateQuality() behavior.
Cover each item type:
- Normal items (quality decreases by 1, by 2 after sellIn passes)
- Aged Brie (quality increases)
- Backstage passes (complex quality rules)
- Sulfuras (never changes)

Include edge cases: quality boundaries (0, 50), sellIn boundary (0), negative sellIn.

Start a scratchpad file. Create refactoring-notes.md to track progress:

# Gilded Rose Refactoring Progress

## Characterization Tests
- [ ] Normal items: basic degradation
- [ ] Normal items: double degradation after sellIn
- [ ] Normal items: quality floor (0)
- [ ] Aged Brie: quality increase
- [ ] Aged Brie: quality ceiling (50)
- [ ] Backstage passes: +1 when sellIn > 10
- [ ] Backstage passes: +2 when sellIn 6-10
- [ ] Backstage passes: +3 when sellIn 1-5
- [ ] Backstage passes: drops to 0 after concert
- [ ] Sulfuras: no change ever

## Refactoring Steps
- [ ] Extract base Item updater
- [ ] Extract AgedBrie class
- [ ] Extract BackstagePass class
- [ ] Extract Sulfuras class
- [ ] Add Conjured class

## Decisions Made
(record architectural choices here)

Update this file after each test group. The scratchpad survives compaction; conversation memory does not.

What to notice:

Does the scratchpad reduce how often you re-explain context?
Did you need to remind the agent of previous work?
Are all tests passing before you proceed?

Phase 3: Incremental extraction (turns 13-25)

This is the long phase. Extract one item type at a time, running tests after each extraction.

The sequence:

Extract base pattern (turns 13-15)

Extract a base Item updater pattern. Start with normal items.
Create a class or function that handles the default degradation behavior.
Keep the original function working by delegating to the new code.
Run tests after each change.

Extract Aged Brie (turns 16-18)

Extract Aged Brie handling into its own class.
It should implement the same interface as the base item updater.
The main updateQuality() should delegate to this class for Aged Brie items.

Extract Backstage Passes (turns 19-21)

Extract Backstage Passes handling. This has the most complex logic:
- Quality +1 when sellIn > 10
- Quality +2 when sellIn 6-10
- Quality +3 when sellIn 1-5
- Quality drops to 0 after concert (sellIn < 0)
Verify each rule with existing tests.

Extract Sulfuras (turns 22-24)

Extract Sulfuras handling. This is the simplest: do nothing.
Sulfuras never changes quality or sellIn.

Consolidate (turn 25)

Review the extracted classes. The original updateQuality() should now
be a dispatcher that routes to the appropriate handler.
Ensure all tests pass. Update the scratchpad with completed steps.

What to practice during Phase 3:

Compact at turn 18-20. The conversation has accumulated implementation details from earlier extractions. Compact before context degradation affects quality, not after.
Check the scratchpad before each extraction. Ask the agent to read it and confirm progress before continuing.
Commit after each successful extraction. Git commits are checkpoints. If later work breaks something, you can roll back.
Watch for degradation symptoms:
- Agent asks about decisions you already made
- Agent proposes changes that contradict earlier work
- Agent forgets test results from earlier in the session

If symptoms appear:

Check context utilization
Run /compact with a focus prompt
Or start fresh with the scratchpad providing continuity

What to notice:

At what turn did you first compact?
Did you see degradation symptoms?
How did the scratchpad help maintain continuity?
Did you use /rewind for any failed attempts?

Phase 4: Add the new feature (turns 26-30+)

The refactoring is done. Adding Conjured items should be straightforward if your new architecture actually supports extension.

Your prompt:

Add support for Conjured items. According to the requirements:
- Conjured items degrade in quality twice as fast as normal items

Create a new Conjured item handler following the pattern established
for other item types. Add tests for Conjured behavior.
The existing code should not need modification beyond the dispatcher.

What to consider:

The session is long at this point
Should you continue or start fresh with the scratchpad?
A fresh session with good file context might outperform a degraded long session

What to notice:

How many lines of existing code needed modification?
Did the architecture support clean extension?
Was the session still coherent, or had quality degraded?

Analysis

After completing the refactoring, record your session metrics:

Metric	Value
Total turns
Compaction events
Session resets
Rewind uses
Tests added
Classes/modules created

Then answer these questions:

Scratchpad effectiveness: How much did the scratchpad reduce re-explaining? Did you find yourself reading it back to the agent?
Compaction timing: Did you compact proactively or reactively? What triggered your compaction decisions?
Plan mode value: Did starting in plan mode prevent false starts? How did the analysis phase affect later work?
Checkpoint discipline: Did you commit after each extraction? If you used /rewind, what prompted it?
Fresh start threshold: At what point would starting a fresh session have been more effective? What signals indicated that threshold?

What you will probably observe

Around turn 25-30, even well-managed sessions show degradation. The agent asks about decisions recorded in the scratchpad, or proposes changes that conflict with earlier extractions. This is expected. The module's guidance about context limits is based on observations like these.

The scratchpad becomes essential. Reading it back to the agent at phase transitions restores context more reliably than trusting conversation history.

Compacting at turn 18-20 before symptoms maintains higher quality than waiting. Once degradation is visible, context is already lost.

Sessions that skip plan mode often waste early turns on implementation that gets discarded. The analysis phase pays off in reduced total turns.

If Conjured items require modifying existing handlers, the refactoring is incomplete. Clean extension with minimal modification is how you know the architecture works.

Variations

Variation A: Parallel sessions

Instead of one long session, try three:

Session 1: Extract Aged Brie and Sulfuras
Session 2: Extract Backstage Passes
Session 3: Add Conjured and integrate

Use git worktrees or branches to isolate the work. Compare total turns and quality to the single-session approach.

Variation B: No scratchpad

Attempt the refactoring without an external scratchpad. Rely only on conversation context and the codebase. Note when and how context loss manifests. This variation makes the scratchpad's value obvious.

Variation C: Different language

Repeat in a language with different idioms Go, Ruby, or Java. Observe whether context management needs change based on language verbosity and file organization.

What this exercise teaches

Long sessions require active management. A 30+ turn refactoring cannot rely on conversation memory alone. External context scratchpads, CLAUDE.md, committed code compensates for compaction losses.

Plan mode separates analysis from implementation. Starting in read-only mode prevents wasted work on approaches the codebase cannot support.

Incremental progress with verification works. Running tests after each extraction catches errors early. Small verified steps accumulate into large changes more reliably than ambitious single attempts.

Proactive compaction preserves quality. Waiting for symptoms means context is already lost. Compact at natural breakpoints rather than when problems appear.

Scratchpads bridge compaction gaps. What survives compaction is unpredictable. What survives in a file is deterministic. Decisions, progress, and architectural choices belong in files.

Completion

The exercise is complete when:

All original tests pass
Each item type has its own handler
Conjured items are implemented and tested
The original updateQuality() is a clean dispatcher
The context management analysis is documented
The scratchpad reflects the complete refactoring journey

Working code is the minimum bar. The real goal is demonstrating that context management enables coherent work across sessions that would otherwise fall apart.

Exercise: Multi-File Refactoring with Context Management

On this page