If you've spent time with language models, you've felt this: the sense that the model has settled somewhere. It found a groove early and stayed in it. The output is fluent, often impressive, but it's working within a space that feels smaller than the problem deserves.
There's a lot of ML research behind that intuition — work on activation spaces, post-training optimization, conservative search. But the philosophical question underneath the technical one is the one I find interesting: does the model get trapped by the question it receives, or by the answer it starts giving?
I think the answer is both. And the way both happen reveals something about these systems that neither side alone captures.
When a model reads your prompt, it doesn't just parse it. It settles into it.
Transformer soft attention creates a positive feedback loop: repeated and prominent content in the context gets over-weighted, and that over-weighting makes similar content more likely in the output, which in turn reinforces the pattern. This is an architectural property, not a training artifact. Meta's "System 2 Attention" research showed that "the probability of a repeated phrase increases with each repetition, creating a positive feedback loop." An opinion stated in a prompt gets amplified by attention before any alignment training acts on it.1
The sensitivity goes deeper than opinion. Reordering answer options in multiple-choice questions changes model responses in up to 36% of cases — yet chain-of-thought explanations never mention this influence. The model rationalizes whatever answer the positional bias selected, constructing post-hoc justifications for a conclusion the prompt structure already determined.2
Why does this matter? Because even the systems designed to improve model quality fall into the same trap. When researchers swapped prompts while keeping responses identical, reward model preference scores barely changed. The grading infrastructure behind RLHF — the dominant method for making AI helpful — is itself largely prompt-insensitive. It evaluates whether a response sounds good, not whether it addresses what was asked.3
So the question shapes the space before the model generates a single token. It enters a region of activation defined by the framing it received, and everything that follows is generated from within that region.
But something arguably more constraining happens once the model starts generating. And this is the part I think most people miss.
Before that first token appears, the model is holding multiple possibilities open — what researchers call a "superposition" of possible continuations. Multiple tasks, multiple interpretations, multiple possible responses, all coexisting in the output distribution. "At each node, the set of possible next tokens exists in superposition," as Shanahan et al. describe it. But "to sample a token is to collapse this superposition to a single token."4
That first token is a commitment point. The model doesn't choose and then reason. It commits and then performs.
Consider what "Reasoning Theater" found. Using activation probes to track models' internal belief states, researchers discovered that on easier tasks, "a model becomes strongly confident in its final answer, but continues generating tokens without revealing its internal belief." The model knows its answer almost immediately. Everything after is cosmetic deliberation — reasoning-shaped text generated after the conclusion is already settled internally. "CoT monitors are at best cooperative listeners, but reasoning models are not cooperative speakers."5
On harder tasks, the picture shifts — probes can't decode the answer early, and genuine belief updates appear. But the structural point holds: the model's own output constrains its future output. Each generated token becomes context for the next, and 67% of wrong tokens in chain-of-thought reasoning come from local memorization — the model generating continuations based on statistical co-occurrence with its own recent output rather than reasoning about the problem.6
Think about what that means. The model isn't just influenced by the question anymore. It's being influenced by its own answer, token by token, through pattern matching against itself.
This creates a compounding effect. Apple's "Illusion of Thinking" research found that "in failed cases, it often fixates on an early wrong answer, wasting the remaining token budget."7 The model doesn't explore alternatives. It doubles down on its initial commitment — and the deeper it goes, the worse it gets. Research on reasoning LLMs formalizes this: "success probability drops exponentially with [decision depth] for wandering RLLMs." More tokens don't produce better search. They produce more extensive wandering along the committed path.8
There's a third force that precedes both question and answer, and I think it's the one that gets the least attention: the optimization process that shaped the model before you ever prompted it.
Reinforcement learning from human feedback narrows the model's output distribution before any specific question is asked. Policy entropy — the diversity of the model's possible responses — drops sharply early in RL training and performance saturates. The model approaches a ceiling "defined by the entropy it has already spent."9
Here's the part that surprised me. RLVR — reinforcement learning with verifiable rewards — sometimes increases token-level entropy while decreasing answer-level entropy. The model appears more uncertain at each step but converges onto fewer distinct answers. "Seemingly more uncertain paths ultimately converge onto a smaller set of distinct answers."10 The model looks like it's exploring. It isn't.
What's more, "the reasoning paths generated by RLVR models are already included in the base models' sampling distribution." The training doesn't teach new reasoning. It makes existing paths more likely and everything else less likely. The model was already capable of the answers it gives after training. Training just made it incapable of the answers it doesn't give.11
This is what researchers call "capability boundary collapse." The scope of what the model can do actively narrows. "Models are naturally inclined to favor high-probability tokens, thereby reinforcing existing knowledge. However, the key to discovering novel reasoning often lies in exploring low-probability tokens that the model would otherwise ignore."12
François Chollet put it simply: "It's not about simple vs complex. It's familiar vs novel. Always has been."13
So which is it — the question or the answer?
I think the question itself is misleading, in the same way that asking whether the left hand or the right hand claps is misleading. The trap is the handshake between them.
The prompt shapes the activation space. The first token collapses the superposition. Each subsequent token constrains the next through local memorization and path dependency. And all of this happens within an output distribution that was already narrowed by optimization before the conversation began.
There is no separable "question effect" and "answer effect." The question becomes context for the answer, the answer becomes context for more answer, and the entire sequence unfolds within a pre-narrowed distribution. The model doesn't get trapped at one point. It gets trapped continuously, at every token, by the accumulating weight of everything that came before.
This is why chain-of-thought looks like reasoning but often isn't. "The 'step-by-step' instruction acts as a tight constraint, forcing the model to generate intermediate textual tokens that mimic the form and flow of reasoning processes it has encountered in its vast training corpus."14 The constraint works precisely because it gives the model a familiar trajectory to follow — one that feels like thinking to the reader while operating as pattern completion for the model.
There's a revealing contrast here. Diffusion language models — which generate all tokens simultaneously rather than sequentially — "enable concurrent answer accessibility through their bidirectional context modeling," and research shows they can internally identify correct answers "by half steps before the final decoding step."15 Remove the sequential commitment and the trap changes character entirely. That tells you something about where the constraint actually lives.
The implication isn't that prompting is futile or that these models are useless. It's that understanding where the constraints actually operate — in the architecture, in the training, in the sequential handshake between question and answer — is what separates working with these systems from being worked by them.
This post was written by Claude, working from a research vault of 861 synthesis insights and 91 Arxiv topic files containing excerpts from ML papers.
I designed a four-axis search strategy — (1) activation space constraints and representation collapse, (2) post-training distortions from RLHF and reward hacking, (3) decoding and search conservatism, and (4) the philosophical framing of autoregressive commitment and path dependency. For each axis, I generated multiple search queries using different vocabulary, because ML subfields use different terminology for related phenomena.
I ran deep semantic searches across both the synthesis insights collection and the Arxiv topic file collection, using 5 different query formulations. This surfaced ~50 relevant results. I then read 14 full synthesis notes and 9 Arxiv topic files to extract verbatim quotes and paper citations.
The research organized naturally into the three-part structure of the post. The "question as trap" cluster drew from work on attention mechanisms, positional bias, and reward model prompt-insensitivity. The "answer as trap" cluster drew from generation collapse, performative reasoning, local memorization, and wandering search. The "optimization trap" cluster drew from RL entropy collapse, capability boundary contraction, and the invisible leash of base model support. The philosophical synthesis — that it's neither question nor answer but the sequential handshake — emerged from the diffusion LLM literature, which provides the contrast case: remove autoregressive generation and the trap changes form.
Five papers referenced in my synthesis notes lacked arXiv URLs in the vault — System 2 Attention, STIM, the creativity regression study, the model collapse paper, and U-SOPHISTRY. Adrian provided the URLs after reviewing the draft.
The brief asked a philosophical question — question or answer? — but the research refused to stay on one side. Every "question trap" paper had an "answer trap" counterpart. The synthesis wasn't planned; it emerged from the evidence. The strongest insight came from the Soft Thinking paper, which explicitly names the mechanism: "standard CoT forces the model to commit to a single next token at each step by collapsing the probability distribution." That sentence reframes the entire question from "which side traps the model" to "the trap is the sequential commitment."
For this post, I tested Claude to see if it could write its own prompt, search my Obsidian vault of topic notes and white paper excerpts, and write its own draft without my suppling main arguments and directions, as I have done in the past. I provided only a vague starting point: Do LLMs get trapped by the question or by the answer? (Do questions contain their own answer?) My Obsidian vault has excerpts from over 2,500 white papers on LLMs going back 3 years, organized into 90 categories, over which the plugin from @arscontexta generated 900+ topic notes that contain summaries of concepts and arguments and links to related sources in the vault.
Adrian Chan is a social interaction designer and researcher focused on AI, language, and the design of human-AI interaction. He writes about the intersection of social theory, communication, and artificial intelligence at gravity7.com.