L3: Human Review, Provenance

Human Review, Provenance, and Stratified Metrics

In the previous lessons, we covered how to build multi-agent systems, handle errors, and manage massive context windows. However, an enterprise system is only as valuable as its auditability. If an agent approves a loan or denies a warranty claim, the architecture must prove why that decision was made. This lesson covers how to architect for Provenance , integrate Human Review , and measure success using Stratified Metrics.

1. The "Black Box" Problem and Provenance

A major barrier to deploying LLMs in regulated industries (like healthcare, finance, or legal) is the "Black Box" effect. If an agent outputs a summary, it is naturally difficult to tell if it synthesized the information from the provided documents or if it hallucinated based on its training data.

Provenance is the architectural guarantee of data lineage. It is the ability to trace every factual claim made by the agent directly back to a specific row in a database, a specific API response, or a specific paragraph in a provided document.

2. Architecting for Provenance

Provenance is not a default LLM behavior; it must be strictly engineered into the system prompt and the JSON schema.

Architects enforce provenance using two primary prompt engineering techniques:

Direct Quoting Fences: Force the agent to extract and output exact string matches from the source text into a designated <quotes> XML block before it is allowed to write its final synthesized response. This anchors the model's attention to reality.
Reference Tagging: Require the agent to append the exact source ID (e.g., [Doc_4_Page_12]) to the end of any factual claim it generates. If the source ID is missing, your application code should automatically flag the sentence as a potential hallucination.

3. Human Review Patterns (HITL)

Even with strict provenance, high-stakes workflows require a Human-in-the-Loop (HITL). Architects must decide where to place the human in the workflow stream.

Synchronous Review (The Gatekeeper): The agent's workflow halts completely. It drafts a response or prepares a database mutation, but the action is physically blocked until a human clicks "Approve" in the UI. Used for high-risk write actions (e.g., executing infrastructure code, sending external customer emails).
Asynchronous Review (The Auditor): The agent completes the task autonomously, but the entire session log (the inputs, the tool calls, the provenance tags, and the output) is packaged and sent to a review queue. Human auditors sample 5% of these logs weekly to score the agent's performance. Used for low-risk, high-volume tasks (e.g., tagging internal support tickets).

4. The Failure of Binary Metrics

When designing feedback systems for LLMs, developers often default to a simple "Thumbs Up / Thumbs Down" button for users. For an AI Architect, this is useless telemetry.

If a user clicks "Thumbs Down," you do not know what broke. Did the agent hallucinate a fact? Did it use the wrong brand tone? Did it format the JSON incorrectly? Did it fail to use a tool?

5. Stratified Metrics

To successfully iterate on an agentic system, you must implement Stratified Metrics. This means breaking down the concept of "quality" into distinct, measurable categories.

When human reviewers audit an agent's output (or when you build automated evaluation pipelines), they should score the output across multiple independent axes:

Factual Accuracy (Groundedness): Did the agent invent any information not present in the provided context?
Instruction Adherence: Did the agent follow every negative constraint (e.g., "Do not use exclamation points")?
Tool Selection Accuracy: Given the user's prompt, did the agent choose the most efficient tool, or did it take an unnecessary path?
Formatting/Schema Compliance: Did the final output perfectly match the requested JSON or Markdown structure?

6. Automated Evaluations (LLM-as-a-Judge)

Relying entirely on manual human review is impossible at an enterprise scale. Architects build automated Eval Pipelines to test their systems.

Before deploying a change to a system prompt or a tool description, you run thousands of historical prompts through the new architecture. Instead of humans grading the output, you spin up a separate, highly capable LLM (like Claude 3.5 Sonnet or Opus) configured as a "Judge."

The Judge LLM is given the Stratified Metrics rubric and the agent's output. It programmatically scores the output, providing you with a mathematical pass/fail rate for your new architecture before it ever touches a production user.