Pydantic with llm.with_structured_output — pitfalls and safeguards

Preview

with_structured_output binds a Pydantic BaseModel to an LLM so the model returns a validated object instead of free text. It works well until your own system prompt starts fighting the one the method injects behind your back. Here is what actually goes wrong and how to keep it boring.

1. Mandate raw-JSON output in the system prompt

Make it explicit that the reply must be a single JSON object. Without this the model tends to wrap the payload in prose or markdown fences, which breaks parsing.

2. Don’t restate the schema in your own prompt

with_structured_output injects its own hidden system prompt that already tells the model to conform to the supplied BaseModel. If you also restate every field constraint, the two rule sets can clash — leading to validation errors or hallucinated keys.

Keep your prompt about the format. Let the method own the schema.

3. Watch evaluation loops that wait on a `binary_score`

A common grader pattern polls until the score flips:

result = grader.invoke({...})
while result.binary_score.lower() != "yes":
    result = grader.invoke({...})   # keep polling

If the model ever returns malformed JSON — or anything other than "yes" / "no" — Pydantic raises a validation error. The surrounding while silently retries, and the workflow becomes an infinite loop.

4. Practical safeguards

Risk	Mitigation
JSON parse failure	Wrap the `invoke` call in `try/except` and break after N retries.
Unexpected fields	Set `extra = "forbid"` on the model so issues surface immediately.
Non-terminating loop	Add a `max_attempts` or timeout in the LangGraph node; return a fallback if exceeded.
Model drift (`"Yes"` vs `"yes"`)	Normalise with `.strip().lower()` before comparison.

A bounded version of the loop above:

from pydantic import BaseModel, ConfigDict

class Grade(BaseModel):
    model_config = ConfigDict(extra="forbid")  # Pydantic v2; use `class Config` on v1
    binary_score: str

max_attempts = 5
for attempt in range(max_attempts):
    try:
        result = grader.invoke({...})
    except Exception:           # JSON / validation failure
        continue
    if result.binary_score.strip().lower() == "yes":
        break
else:
    result = fallback_response()  # loop exhausted, don't hang

5. Recommended minimal system prompt

You are a JSON-only assistant. Respond with a JSON object that matches the schema
exactly—no commentary, no extra keys, no markdown fencing.

That’s the whole job of your prompt: enforce the format. with_structured_output handles the schema specifics.

Bottom line

Keep the system prompt succinct and delegate schema enforcement to with_structured_output. You avoid prompt collisions, surface validation issues early instead of swallowing them, and never ship a loop that can run forever.