How many examples is "few"?

Usually 2 to 5. One example sometimes works but is fragile, the model overfits to its quirks. Beyond about 5 you get diminishing returns and you eat context window for little gain. If you find yourself adding a sixth example, your task probably needs structured output (a schema) or a fine-tune, not more shots.

Do the examples need to be real?

They need to be representative. Synthetic examples are fine, even preferred, because you can construct edge cases. But synthetic examples that look unlike the real input bias the model in unhelpful directions.

Where in the prompt should the examples go?

Between the instruction and the real input. Instruction first ("classify the following…"), then examples, then the real input you want classified. Putting examples after the real input confuses the model more often than not.

Zero-Shot vs. Few-Shot Prompting

Zero-shot and few-shot are the two ways you can ask a model to do something. Zero-shot is the default: you describe the task in words and the model attempts it. Few-shot adds examples: you show the model what good input/output pairs look like and then give it the real input.

The trade-off is simple. Zero-shot is shorter and cheaper. Few-shot is more reliable, especially when the output shape matters.

When zero-shot is enough

Modern frontier models are strong zero-shot learners. For most tasks that a competent human could do from a one-paragraph description, zero-shot works. Examples:

"Rewrite this paragraph at a 7th-grade reading level."
"Translate this into German, keeping the register formal."
"What are the three weakest claims in this draft?"

These tasks have a natural shape and a wide tolerance for variation. The model has seen a thousand examples of summarization and translation in training; you do not need to remind it what those look like.

When few-shot starts paying off

Few-shot earns its tokens when the task has a precise, narrow shape and zero-shot produces inconsistent results. The classic cases:

Extraction. Pulling specific fields from messy text. The output structure is usually a JSON object or a list, and one example pins down exactly which fields and how to handle missing values.
Classification. Especially when the categories are non-obvious or domain-specific. "Is this customer email a refund request, a feature request, or a bug report?" is much more reliable with two examples per class.
Formatting. When you want a particular Markdown layout, a particular sentence cadence, or a particular vocabulary, examples carry that information far better than adjectives.
Style transfer. "Write this in our voice" is unworkable without examples. "Write this in our voice — here are three short examples" works.

What few-shot actually does

A common misconception is that few-shot prompting "teaches" the model a new skill. It does not. The weights are frozen during inference. What few-shot does is condition the next-token distribution to look more like the examples. The model is doing the same thing it always does (predicting the next token), just with very strong recent priors.

Practical implication: the examples have to look like what you want, including style and length. If your three examples are all one sentence long, the model will produce one-sentence answers even if the real input deserves more. If your examples all end with a confidence note, the model will append a confidence note to the real answer. This is sometimes a feature, sometimes a bug.

A worked example

You want to classify support tickets into three buckets: refund, feature request, bug report.

Zero-shot attempt:

Classify the following support ticket as one of: refund, feature request, bug report.

Ticket: "Hi, the export button does nothing on Safari but works fine in Chrome. Build 4.3."
Category:

This often works. It also occasionally outputs "bug" or "Bug Report" or "It sounds like a bug report, possibly…", and now your downstream parser breaks.

Few-shot version:

Classify each support ticket as exactly one of: refund, feature_request, bug_report.

Examples:
Ticket: "Can you add a dark mode option in the settings?"
Category: feature_request

Ticket: "I was charged twice for May, please reverse one payment."
Category: refund

Ticket: "The export button does nothing on Safari but works fine in Chrome. Build 4.3."
Category: bug_report

Now classify this ticket. Return ONLY the category, lowercase, no prose.

Ticket: "..."
Category:

The examples lock in the exact label vocabulary (feature_request with the underscore), the format (one token, lowercase), and the level of judgment expected.

When you have outgrown few-shot

If you find yourself maintaining a long list of few-shot examples that you constantly tweak, you have probably outgrown prompting. Three escape hatches:

JSON mode or schema-constrained output. Most providers now accept a JSON schema; the model is forced to emit valid output. Use this for any extraction task at scale.
Retrieval augmentation. If the few-shot examples are really domain knowledge ("this is how our company classifies tickets"), put them in a database and retrieve the most relevant examples per request.
Fine-tuning. If you have hundreds of stable examples and the task is critical, a fine-tune will outperform any prompt — at the cost of needing to manage a custom model.

Most projects never need to escape. Zero-shot for the easy stuff, three to five examples when you need precision, and that is the entire fundamentals layer.

Zero-Shot vs. Few-Shot Prompting

Article summary

When zero-shot is enough

When few-shot starts paying off

What few-shot actually does

A worked example

When you have outgrown few-shot

Frequently asked questions

See also

Where to go next