Zero-shot and few-shot are the two ways you can ask a model to do something. Zero-shot is the default: you describe the task in words and the model attempts it. Few-shot adds examples: you show the model what good input/output pairs look like and then give it the real input.

The trade-off is simple. Zero-shot is shorter and cheaper. Few-shot is more reliable, especially when the output shape matters.

When zero-shot is enough

Modern frontier models are strong zero-shot learners. For most tasks that a competent human could do from a one-paragraph description, zero-shot works. Examples:

  • "Rewrite this paragraph at a 7th-grade reading level."
  • "Translate this into German, keeping the register formal."
  • "What are the three weakest claims in this draft?"

These tasks have a natural shape and a wide tolerance for variation. The model has seen a thousand examples of summarization and translation in training; you do not need to remind it what those look like.

When few-shot starts paying off

Few-shot earns its tokens when the task has a precise, narrow shape and zero-shot produces inconsistent results. The classic cases:

  • Extraction. Pulling specific fields from messy text. The output structure is usually a JSON object or a list, and one example pins down exactly which fields and how to handle missing values.
  • Classification. Especially when the categories are non-obvious or domain-specific. "Is this customer email a refund request, a feature request, or a bug report?" is much more reliable with two examples per class.
  • Formatting. When you want a particular Markdown layout, a particular sentence cadence, or a particular vocabulary, examples carry that information far better than adjectives.
  • Style transfer. "Write this in our voice" is unworkable without examples. "Write this in our voice — here are three short examples" works.

What few-shot actually does

A common misconception is that few-shot prompting "teaches" the model a new skill. It does not. The weights are frozen during inference. What few-shot does is condition the next-token distribution to look more like the examples. The model is doing the same thing it always does (predicting the next token), just with very strong recent priors.

Practical implication: the examples have to look like what you want, including style and length. If your three examples are all one sentence long, the model will produce one-sentence answers even if the real input deserves more. If your examples all end with a confidence note, the model will append a confidence note to the real answer. This is sometimes a feature, sometimes a bug.

A worked example

You want to classify support tickets into three buckets: refund, feature request, bug report.

Zero-shot attempt:

Classify the following support ticket as one of: refund, feature request, bug report.

Ticket: "Hi, the export button does nothing on Safari but works fine in Chrome. Build 4.3."
Category:

This often works. It also occasionally outputs "bug" or "Bug Report" or "It sounds like a bug report, possibly…", and now your downstream parser breaks.

Few-shot version:

Classify each support ticket as exactly one of: refund, feature_request, bug_report.

Examples:
Ticket: "Can you add a dark mode option in the settings?"
Category: feature_request

Ticket: "I was charged twice for May, please reverse one payment."
Category: refund

Ticket: "The export button does nothing on Safari but works fine in Chrome. Build 4.3."
Category: bug_report

Now classify this ticket. Return ONLY the category, lowercase, no prose.

Ticket: "..."
Category:

The examples lock in the exact label vocabulary (feature_request with the underscore), the format (one token, lowercase), and the level of judgment expected.

When you have outgrown few-shot

If you find yourself maintaining a long list of few-shot examples that you constantly tweak, you have probably outgrown prompting. Three escape hatches:

  1. JSON mode or schema-constrained output. Most providers now accept a JSON schema; the model is forced to emit valid output. Use this for any extraction task at scale.
  2. Retrieval augmentation. If the few-shot examples are really domain knowledge ("this is how our company classifies tickets"), put them in a database and retrieve the most relevant examples per request.
  3. Fine-tuning. If you have hundreds of stable examples and the task is critical, a fine-tune will outperform any prompt — at the cost of needing to manage a custom model.

Most projects never need to escape. Zero-shot for the easy stuff, three to five examples when you need precision, and that is the entire fundamentals layer.