Exercise: Context Engineering Comparison

Overview

This exercise demonstrates how context engineering directly affects code generation quality. The same feature request produces measurably different results depending on how context is structured.

The goal is not to produce perfect code on every attempt. The goal is to observe, measure, and internalize how context quality correlates with output quality.

The task

Add a --dry-run flag to tiged, a CLI tool for scaffolding projects by downloading Git repositories without history.

The flag should:

Preview which files would be downloaded without actually downloading them
Display file names in a readable format
Exit without modifying the filesystem

This feature has been requested by users but not yet implemented, making it a realistic contribution opportunity.

Setup

Clone the repository and install dependencies:

git clone https://github.com/tiged/tiged.git
cd tiged
npm install

Verify the setup works:

npx tiged --help

The output should display available options including --verbose, --cache, and --mode.

Take five minutes to explore the codebase structure. Identify the CLI entry point, the core cloning logic, and how existing flags are implemented. This exploration is part of the exercise discovering codebase structure is essential context engineering work.

The experiment

Run the same task three times using different context approaches. Use either Claude Code or Codex. Record observations for each approach before moving to the next.

Important: Start a fresh session for each approach. Context pollution from previous attempts invalidates the comparison.

Round 1: Minimal context

Provide only the feature request with no additional context.

Prompt:

Add a --dry-run flag to this CLI tool that previews which files
would be downloaded without actually downloading them.

Before proceeding, record:

How many iterations were needed?
Did the agent ask clarifying questions or make assumptions?
Does the implementation follow existing patterns in the codebase?
Does it integrate with the existing --verbose flag?
Are there any obvious bugs or edge cases missed?

Round 2: Front-loaded context

Provide comprehensive context about the codebase before the request. Use what you learned during setup exploration to populate the context accurately.

Prompt template:

I'm working on tiged, a CLI scaffolding tool. Key context:

Repository structure:
- [CLI entry point file]: CLI entry point using [argument parsing library]
- [core module]: Main degit() function that handles cloning
- [utils location]: Helper functions

Existing patterns:
- Flags are defined using [describe how flags are added]
- The degit() function accepts an options object
- Verbose output uses [describe the verbose pattern]
- The tool extracts tarballs from GitHub/GitLab/etc into target directories

Add a --dry-run flag that:
- Lists files that would be extracted without downloading
- Uses the existing verbose output pattern
- Integrates with current option handling
- Works with all supported hosts (GitHub, GitLab, Bitbucket, Sourcehut)

Note: Fill in the bracketed sections with actual details from your codebase exploration. The accuracy of this context directly affects the quality of generated code.

Before proceeding, record:

How did the number of iterations change?
Did the implementation match existing code style?
Were edge cases (different hosts, subdirectory clones) handled?
How does code quality compare to Round 1?

Round 3: Test-driven context

Provide failing tests as the primary context, letting tests specify requirements.

Prompt:

Add a --dry-run flag to tiged. Here are the tests it should pass:

test('dry-run lists files without downloading', async () => {
  const result = await runCLI('--dry-run tiged/tiged');

  // Should list files
  expect(result.stdout).toContain('package.json');
  expect(result.stdout).toContain('README');

  // Should not create any files
  expect(fs.existsSync('tiged')).toBe(false);
});

test('dry-run works with subdirectory targets', async () => {
  const result = await runCLI('--dry-run tiged/tiged/src');

  expect(result.stdout).toContain('index');
  expect(result.stdout).not.toContain('package.json');
});

test('dry-run shows file count summary', async () => {
  const result = await runCLI('--dry-run tiged/tiged');

  expect(result.stdout).toMatch(/\d+ files? would be extracted/);
});

Implement the feature to pass these tests. Follow existing patterns in the codebase.

Note: These tests are illustrative prompts showing the test-driven approach. Adapt the syntax to match the repository's actual test framework (vitest, Jest, etc.).

Record your observations:

Did the tests clarify requirements that were ambiguous in previous rounds?
How did first-attempt success rate compare?
Were edge cases (subdirectory targets) handled correctly?
How does implementation quality compare?

Analysis

After completing all three rounds, compare your observations.

Quantitative comparison

Metric	Minimal	Front-loaded	Test-driven
Iterations to working code
Clarifying questions asked
Edge cases handled
Followed existing patterns
First-attempt success

Qualitative observations

Consider these questions:

Context hierarchy impact: How did project context (codebase understanding) vs prompt context (immediate instructions) affect results?
Signal density: Which approach provided the highest signal-to-noise ratio? Did any approach include context that confused rather than helped?
Pattern recognition: Did the agent discover existing patterns (verbose output, option handling) on its own? When did explicit pattern documentation help?
Failure modes: What failure patterns from the module appeared? Memory failures? Confusion from mixed signals? Confabulation about APIs that do not exist?
The 95% first-attempt rule: Research shows successful implementations typically happen on the first attempt. How did initial prompt quality correlate with final success?

Expected observations

Most practitioners find:

Minimal context produces working code but often violates codebase conventions. The agent makes reasonable assumptions that happen to be wrong for this specific project. Integration with existing features (verbose mode, host support) is frequently missed.
Front-loaded context produces more idiomatic code with fewer iterations. Explicit pattern documentation prevents common mistakes. However, excessive context can introduce noise that degrades quality.
Test-driven context often produces the most reliable results for edge cases. Tests remove ambiguity about expected behavior. However, tests must be well-designed; poor tests produce poor code.

The optimal approach is typically a hybrid: front-load essential patterns and constraints, then use tests to specify precise behavior.

Variations

For additional practice:

Variation A: Context pollution recovery

Start with minimal context. When the first attempt fails, add context incrementally. Observe how accumulated conversation history affects later attempts. Compare to starting fresh with front-loaded context.

Variation B: Different repositories

Repeat the exercise with a different repository such as:

lowdb: Add a query() method with filtering
chalk: Add a rainbow() method for cycling colors

Note whether context sensitivity varies by codebase complexity.

Variation C: Measure semantic similarity

If using Claude Code, ask the agent to summarize its understanding after each context approach. Compare how accurately the summaries reflect the actual codebase. This approximates the semantic similarity measurement discussed in the module.

Key takeaways

This exercise demonstrates several principles from the module:

Context quality determines output quality. The same model with different context produces dramatically different results. Model capability is table stakes; context engineering is the differentiator.
Front-loading essential patterns reduces iterations. Telling the agent about existing conventions prevents it from inventing incompatible ones. This aligns with the research finding that explicit specifications reduce back-and-forth by 68%.
Tests remove ambiguity. Narrative descriptions leave room for interpretation. Tests provide concrete success criteria. This explains why test-as-specification outperforms prose descriptions in code generation studies.
The first prompt matters most. Successful implementations happen on the first attempt. Iterating from a poor starting point rarely recovers to optimal quality. Invest time in prompt construction rather than correction cycles.
Context is not just prompts. Project structure, file organization, and existing patterns are all context. The agent reads the codebase. Well-organized codebases provide implicit context that reduces explicit prompting needs.

Completion

The exercise is complete when:

All three approaches have been attempted with fresh sessions
Observations have been recorded for each approach
The comparison analysis has been completed
At least one variation has been explored (optional but recommended)

The goal is not perfect code. The goal is developing intuition for how context engineering affects outcomes an intuition that improves with every subsequent agent interaction.

Exercise: Context Engineering Comparison

On this page