If a model has a 1M token context window, why not just give it everything?

Long-context evals consistently show that performance degrades the more you put in, even when the context fits. Models pay disproportionate attention to the start and end of the context and miss things in the middle (the so-called "lost in the middle" effect). Putting in 200,000 tokens when the answer depends on 2,000 of them produces measurably worse answers than retrieving the relevant 2,000 and skipping the rest.

Should I summarize long documents before sending them?

Only if the summary contains the specific facts the model will need. A summary that strips out numbers, names, or dates is worse than the original. For Q&A over long documents, retrieval (vector search or keyword search to pull the relevant chunks) almost always beats summarization.

Where should I put the user's question relative to the context?

For most tasks, instruction at the top, context in the middle, and the specific question at the bottom right before the model answers. This keeps the question fresh in the model's "recent attention" and reduces the risk of the model forgetting what it was asked to do after a long document.

How to Give an LLM Context

Context is the part of a prompt that gives the model the information it needs to answer. It is also the part people get most consistently wrong: too little context produces hallucinations, too much produces drift, and the wrong order produces a model that answers a question you did not ask.

A few practical rules cover most cases.

Rule 1: provide what is necessary, not what is available

The temptation, especially with long-context models, is to paste the whole document and ask a narrow question. Resist it. Long-context evaluation work over the past few years consistently finds the same thing: models do worse the more irrelevant material they have to wade through. This is true even for models advertised with million-token windows.

The pattern to aim for: do the relevance filtering yourself or with a retrieval system, then hand the model a focused context. If you are answering "what does our contract say about indemnification?" you want the contract's indemnification section, not the whole contract.

Rule 2: the position matters

Attention is not uniform over the context window. Tokens at the start of the context (right after the system prompt) and at the end (right before the model's turn to speak) get disproportionate attention. Tokens in the middle of a long context can get genuinely overlooked, which is the "lost in the middle" effect.

In practice:

Put critical context at the top.
Put the specific question at the bottom, just before the model answers.
Avoid burying the one paragraph that contains the answer in the middle of a 50-page document if you can help it.

For multi-document tasks, give the model a clear separator between documents and a short index at the top of the prompt that lists what each document is.

Rule 3: separate context from instructions

Mixing instructions and context in one wall of text confuses the model. Use clear structural markers:

INSTRUCTION:
Summarize the following meeting transcript in 5 bullets, focused on decisions.

CONTEXT:
---
[meeting transcript here]
---

OUTPUT:

The triple-dash fence is not magic; any consistent separator works. The point is to make it unambiguous to the model where its task ends and its data begins. This also makes prompt-injection attacks (where the context contains instructions) easier to detect and harder to act on.

Rule 4: tell the model what is NOT in context

If the model might need information you have not provided, say so up front:

The document below is the only source of truth. If the answer is not in the document, say "not stated in this document" rather than guessing.

This single sentence reduces hallucination on Q&A tasks substantially. The model is otherwise trained to be helpful, which often means inventing a plausible answer.

Rule 5: keep examples consistent with the real context

If you are using few-shot examples and they have a different style or structure than the real context, you are sending mixed signals. Either rewrite the examples to match the real input, or write the prompt to handle both naturally.

When to use retrieval instead of stuffing

If you find yourself doing any of these, you have outgrown direct context-stuffing and want retrieval (RAG):

Pasting the same long reference document into every request.
Maintaining a "context" file that grows over time.
Truncating documents to fit the window.
Watching the model miss facts that are clearly in the prompt.

A small retrieval setup is not heavy: a vector database, a chunking strategy, and a top-k lookup before the prompt. The cost of setting it up is repaid the first time a user asks something specific about your large corpus and the model actually gets it right.

When to use long context instead of retrieval

There are real cases where long context beats retrieval:

The relevant information is genuinely spread across the whole document and can't be chunked cleanly (e.g., understanding the style of a 30-page document to match it).
The task is whole-document analysis: "summarize," "rewrite," "translate."
The document is small enough (under ~20k tokens) that retrieval would be over-engineering.

Even in those cases, the rules above still apply: put the instruction at the top, the question at the bottom, and tell the model what to do if the document does not cover it.

Context is the prompt's bandwidth. Use it like bandwidth: with intent, with structure, and with awareness of what you are spending it on.

How to Give an LLM Context

Article summary