L2: Few-Shot Prompting for Consistency

Few-Shot Prompting for Consistency

In Lesson 4.1, we established that explicit instructions are the foundation of a reliable prompt. However, even the most meticulous rules can sometimes result in varied outputs because LLMs are inherently probabilistic. When your application's downstream code strictly depends on a specific tone, format, or JSON structure, instructions alone are often not enough.

This lesson covers Few-Shot Prompting , the architectural technique of "showing" rather than just "telling."

1. Zero-Shot vs. Few-Shot Execution

To understand the architectural shift, you must differentiate between the two primary prompting paradigms:

Zero-Shot Prompting: You provide instructions and the data, and Claude must generate the answer without having seen a prior example of success. (e.g., "Translate this text to French.")
Few-Shot Prompting: You provide instructions, data, and a few "Golden Examples" of what a perfect input-to-output translation looks like. (e.g., "Translate to French. Example 1: Hello - > Bonjour. Example 2: Apple -> Pomme. Now translate: Car.")

2. Why Consistency Requires Examples

In enterprise architecture, you are often piping Claude's output directly into a database or another API. If Claude adds a conversational prefix like "Here is the data you requested:" before outputting the data, your parser will break.

Few-shot prompting solves three critical consistency problems:

Format Adherence: It mathematically forces the LLM to adopt your exact desired output structure (e.g., specific markdown tables, nested JSON, or specific date formats like YYYY-MM-DD).
Tone and Persona Calibration: It is incredibly difficult to define "brand voice" via rules. Showing Claude three examples of how your customer service bot should politely decline a refund is exponentially more effective than writing ten paragraphs of psychological instructions.
Pattern Recognition Acceleration: LLMs are pattern-matching engines. By providing examples, you reduce the cognitive load required for Claude to interpret your rules, leading to faster inference and fewer hallucinations.

3. Structuring a Production Few-Shot Prompt

Just like instructions, examples must be structurally isolated using Anthropic's recommended XML tags. Do not casually mix examples into the main text.

The Architectural Template:

XML

<instructions>
Extract the company name and stock ticker from the user query. Output strictly in the format: [Company]: [TICKER].
</instructions>

<examples>
  <example>
    <input>How is Apple doing today?</input>
    <output>Apple Inc: AAPL</output>
  </example>
  <example>
    <input>Did you see the news about Tesla's new factory?</input>
    <output>Tesla: TSLA</output>
  </example>
</examples>

<user_query>
[Dynamic User Input Injected Here]
</user_query>

Architectural Note: By using the <example> wrapper with internal <input> and <output> tags, you create an unmistakable boundary. Claude understands that these are static references, not active commands to execute.

4. Designing "Golden Examples" (Coverage and Diversity)

The most common mistake developers make with few-shot prompting is providing three identical examples of the "happy path." If you only show Claude what to do when everything goes right, it will fail catastrophically on an edge case.

An architect curates a diverse set of examples:

The Happy Path: Standard input, standard output.
The Edge Case: Input with typos, weird formatting, or secondary languages.
The Failure State (Crucial): Show Claude exactly how to respond when the data is missing or the task is impossible.
- Example: <input>What is the weather?</input> <output>ERROR: Missing target company.</output>

5. The "Overfitting" Risk and Token Trade-offs

Few-shot prompting is powerful, but it comes with distinct architectural trade-offs:

Token Bloat: Examples consume context window tokens. If you include ten massive JSON examples in your system prompt, you will drastically increase your cost-per-call and latency.
Prompt Overfitting: If your examples are too similar, the model might "overfit" to the examples and ignore the actual rules. For instance, if all your examples happen to result in a positive sentiment analysis, Claude might become mathematically biased to classify all new inputs as positive, regardless of what the text actually says.

The Architect's Balance: Use 2 to 4 highly diverse, concise examples. If you find yourself needing 20 examples to get Claude to behave, your underlying <instructions> are flawed and need to be rewritten.