Prototypical Writing — Why AI Gives You a Finished Draft That Isn't Finished. Adrian Chan

Adrian Chan

Prototypical Writing

Why AI gives you a finished draft that isn't finished

The prototype is nearly good enough to publish. The gap between "nearly" and "good enough" is not polish. It is investment.

You ask an LLM to help you write something, and what comes back is not what you expected. It is not a rough draft. Rough drafts are unfinished — they have gaps where the thinking is unfinished, signs of the writer's uncertainty, half-formed paragraphs that simply trail off. What the LLM gives you is something else: a text that looks, on first read, like a finished piece. It has an introduction, a thesis, supporting arguments, transitions, and a conclusion that even lands in the right register. It reads like something a competent writer produced after spending a few days with the material.

And then you sit with it for a while, and something is off. You don't have a draft. You have a prototype.

The arguments are all there, but they sit side by side rather than building on one another. The hedges are in the right places, but they do not hedge anything in particular — they are hedging as a style, the way a show home has throw pillows. The conclusion does not follow from the arguments so much as restate them in a concluding voice. The whole thing is complete the way a model apartment is complete: everything is present, nothing is inhabited. In fact, very much like this paragraph.

I have come to think of this as prototypical writing. A draft implies the work is in progress — gaps where the thinking will happen next. This is not that. This is a prototype: a full-scale, fully-surfaced model of the finished thing that cannot bear the weight a real argument needs to bear. If you have worked with design prototypes, you know the feeling. You can see the shape, check the scale, show it to someone. You cannot ship it. Anyone who leans on it will feel the give.

Why? What is it about how LLMs work that produces this — text that is complete, comprehensive, and strangely uninvested in its own claims?

The generation-evaluation gap

The most direct explanation comes down to a dissociation between two operations that, in human writing, are deeply entangled.

When researchers asked over a hundred NLP experts to generate research ideas and compared them to LLM-generated ideas, the LLM's were rated significantly more novel.¹ More novel, but less feasible. And when a follow-up study assigned both sets to forty-three researchers who each spent over a hundred hours implementing them, the LLM ideas scored lower on every metric.² Execution revealed what ideation concealed: missing baselines, impractical methods, ideas that did not survive contact with reality.

This is the generation-evaluation gap. LLMs are powerful generators with combinatorial reach no individual can match, unconstrained by disciplinary priors or the practical consequences of being wrong. They connect concepts a domain expert would never connect, precisely because they have no stake in whether the connection holds. What they cannot do is evaluate — tell the difference between a connection that illuminates and one that merely sounds like it does. That distinction requires judgment the architecture does not produce.

The prototype is complete because generation is cheap when evaluation is absent. Every section gets written because writing sections is what the pattern demands. Whether any given section should have been written is a question the system does not ask.

Why the text is disinterested

But it is not just completeness. It is completeness without interest. The text does not seem to care about its own argument. You feel this before you can name it — a smoothness, an evenness that reads like a report from nowhere.

Researchers have studied this directly. Comparing how ChatGPT and human students use metadiscursive nouns, they found a clean split.³ ChatGPT preferred manner nouns — method, approach, process — descriptively precise, evaluatively neutral. Students preferred status and evidential nouns — claim, argument, hypothesis, evidence, finding — nouns that commit the writer to a position. AI text describes. Human text argues.

There is an orientation difference too. AI text tends to point backward, summarizing what has been said.⁴ Human argumentative writing points forward, framing what it is about to show you. The backward-pointing writer reports. The forward-pointing writer bets — here is what I am going to establish, stay with me.

The deeper mechanism is in training. Alignment training optimizes for responses that satisfy the user per turn — helpful, complete, cleanly closed.⁵ This works against rhetorical turbulence: tangents, qualifications, objections, counter-positions. Turbulence does not score well when the regime rewards smooth closure. The system learns to resolve rather than open, and the prototype inherits this. Every paragraph concludes, every section wraps up, the whole piece hums with the satisfaction of something that was never in doubt.

One more finding names this at the mechanical level. When an LLM is asked whether a premise supports a hypothesis, its prediction is driven by whether the hypothesis sounds like a true thing in general — whether the model has seen it attested in training data — not by whether the premise actually entails it.⁶ The system reaches for claims it recognizes, not claims the argument warrants. This is why prototypical writing can feel well-sourced and oddly arbitrary at the same time: the references are real, the claims plausible, but the selection is driven by co-occurrence, not by logic.

What writing actually does

So what does writing actually do that the prototype skips?

For most people who write seriously, writing is not transcription. It is the process by which thought becomes articulate. You discover what you think by trying to say it. When the sentence does not work, that tells you something about the idea. A paragraph that will not land has an underlying claim that has not been tested. A section that sprawls contains two ideas pretending to be one. The difficulty is the thinking — and the thinking includes judgment, applied in real time, to every sentence as it is written.

This is what I think may be a fundamental handicap of LLM-based writing. The system cannot judge while it generates. It has no internal corrective mechanism — it cannot distinguish its accurate claims from its inaccurate ones using the same generative process.⁷ A human writer evaluates every sentence against a felt sense of whether the claim is warranted, whether the audience will buy it, whether it is actually true. The LLM produces the sentence and the judgment would have to come afterward, if it comes at all. Even when reasoning models are asked to reflect on their own output, they have at that point already generated a direction and made commitments — and the reflection is itself generated by the same process, subject to the same blindness, unable to step outside the distribution it is sampling from.

The prototype skips all of this. It arrives at "finished" without traveling through the process that finishing represents. No sentence fought for its life. No claim was tested against what the audience would accept.

And there is an irony here worth naming. The user surrenders to the LLM's speed, breadth, and seeming completeness — a kind of cognitive surrender to the prototype's polish. But the LLM surrenders too, in its own way. Alignment training instills a preference to satisfy, not to refuse or challenge or push back. The system would rather give you a plausible answer than tell you the question is wrong. Research has shown that making models warmer and more empathetic increases their error rates by seven to twelve percentage points, and makes them significantly more likely to agree with incorrect user beliefs.⁸ The LLM surrenders to alignment the way the user surrenders to fluency — and between the two surrenders, the prototype emerges: text that pleases without warranting, generated by a system that accommodates without judging.

In human writing, commitment accumulates. Each paragraph constrains what comes next, because you cannot unsay what you have said. By paragraph seven you are defending the claim you made in paragraph three, qualifying it, or discovering it was wrong — and those moves produce text that carries the weight of the earlier commitment. Prototypical writing does not accumulate commitment. Each paragraph is generated fresh from the context window. The system treats its own prior paragraphs the way it treats everything in context: as input to predict from, not as commitments to honor.

The wandering mind and the collapsed landscape

Two deeper patterns explain why the prototype covers too much and develops too little.

The first is what researchers call underthinking. Reasoning-oriented LLMs frequently switch between approaches without sufficiently exploring any one of them.⁹ The model starts down a promising path, hits difficulty, jumps to another, hits difficulty again, jumps again — never committing enough to any single direction to see it through. A mechanistic study found the explanation: uncertainty signals dominate the transformer's early layers, while signals related to long-term possibility emerge only in the middle layers.¹⁰ The model has already decided before the signal that would have informed a better decision becomes available. It thinks too fast to explore well.

This is the wandering mind of the prototype. The text touches on many relevant ideas without developing any of them. Each idea is genuinely relevant, but the development is cut short because the system switches rather than commits. The result feels comprehensive the way a table of contents is comprehensive: you see the whole territory, but you have not been taken into it.

The second pattern is diversity collapse. LLM ideation clusters — the system generates ideas that are individually novel but collectively similar, variations on the same high-probability theme.¹¹ You see this in prototypical writing. Each paragraph sounds fresh, but read three in sequence and you realize they are saying the same thing from slightly different angles. Variety without diversity. The system cannot tell it is repeating itself, because the repetition is semantic rather than lexical, and its self-evaluation has been shown to be unreliable.¹²

In multi-agent reasoning, the pattern sharpens. Sixty-one percent of iterations converge through silent agreement — premature convergence driven by accommodation rather than deliberation.¹³ Agents accept each other's outputs without challenge. The same dynamic operates in single-agent writing: the system agrees with its own prior paragraph, extends it, moves on. No internal resistance, no devil's advocate, no moment where the argument has to justify itself.

One more finding ties this together. Researchers tested what long chain-of-thought models actually learn from reasoning demonstrations and found you can randomly change fifty percent of the numbers in a mathematical trace and accuracy drops by only 3.2 percent.¹⁴ Shuffle sixty-seven percent of the reasoning steps and it drops by 13.3 percent. What the model learned is not what to think but how to structure thinking — the shape of a good argument, not the substance. The prototype passes the shape test because shape is what was learned.

Working with the prototype

The prototype is not useless. It is genuinely valuable, as long as you know what it is.

It gives you a map of the territory — concepts, framings, arguments relevant to your topic, assembled faster than you could have done it yourself. A structural scaffold. A quick sense of whether the topic can sustain a post or a chapter. And it surfaces claims you disagree with, which matters more than it sounds, because discovering what you want to argue against is one of the fastest ways to find what you want to argue for.

What it does not give you is selection — the judgment about what to include and what to leave out. It does not give you evidence chosen because it serves your argument. It does not give you accumulated commitment. And it does not give you voice — the sound of a writer who has been somewhere and is telling you what they found.

Use the prototype the way a designer uses a prototype: to test the concept, not to ship the product. Let it show you the shape. Then set it aside and write the real thing, with the map in hand and the words your own.

The nearly-good-enough draft

The prototype is nearly good enough to publish. The gap between "nearly" and "good enough" is not polish, not proofreading, not prompt engineering. It is investment. The real text has a writer behind it — someone who discovered what they thought by trying to say it, who chose what mattered, who committed to claims and lived with the consequences. The prototype has the shape of a finished argument and the weight of a stage prop.

A stage prop is useful if you know it is a prop. You can see the proportions, check the silhouette. You cannot present it as the real thing. Anyone who picks it up will feel how light it is.

The prototype is the beginning of writing, not the end. Treating it as the end is how you fill a world with polished, strangely empty text. Treating it as the beginning is how you write something worth reading. But as with any prototype, it needs to get shipped. For the writer, this is a question of when it's done. For the LLM, a matter of which token is the last.

Notes

Can LLMs Generate Novel Research Ideas? Si et al. (2024). A large-scale blinded study with over 100 NLP researchers found that LLM-generated research ideas were rated significantly more novel than human expert ideas (p<0.05), though slightly weaker on feasibility. The study also identified diversity collapse and failures of LLM self-evaluation as key failure modes: "we identify open problems in building and evaluating research agents, including failures of LLM self-evaluation and their lack of diversity in generation." — arxiv.org/abs/2409.04109
The Ideation-Execution Gap. Si et al. (2025). Forty-three expert researchers each spent over a hundred hours implementing randomly-assigned ideas from both LLMs and human experts. The results closed the gap: "the scores of the LLM-generated ideas decrease significantly more than expert-written ideas on all evaluation metrics (novelty, excitement, effectiveness, and overall; p<0.05), closing the gap between LLM and human ideas observed at the ideation stage." Execution exposed what ideation concealed — missing baselines, impractical evaluations, ideas whose novelty did not survive the encounter with reality. — arxiv.org/abs/2506.20803
Metadiscursive Nouns in Academic Argument: ChatGPT vs Student Practices. A study comparing 145 ChatGPT essays with 145 student essays found that ChatGPT produced 349 metadiscursive nouns (4.79 per 1000 words) while students produced 422 (5.41 per 1000 words). The difference in frequency was not significant — the difference in kind was: "ChatGPT has distinct preferences for simpler syntactic constructions (particularly the determiner + N pattern) and relies heavily on anaphoric references, whereas students demonstrate more balanced syntactic distribution and greater use of cataphoric references." AI text describes. Human text argues. — sciencedirect.com (JEAP)
Anaphoric vs cataphoric text organization. From the same metadiscursive noun study. Anaphoric references point backward — summarizing what has been said, wrapping it up. Cataphoric references point forward — signaling what the writer is about to establish. ChatGPT defaults to the first; students balance both. The difference reflects distinct rhetorical stances: a backward-pointing text reports, a forward-pointing text commits.
Grounding Gaps in Language Model Generations. Shaikh et al. (2023). This study found that "off-the-shelf LLM generations are, on average, 77.5% less likely to contain grounding acts than humans" — the clarifying questions, acknowledgments, and understanding-checks that build shared meaning. Worse, preference optimization (the training that makes models "helpful") actually eroded grounding further. What we experience as fluency is partly the absence of the communicative work that makes conversation reliable. — arxiv.org/abs/2311.09144
Attestation bias in LLM entailment predictions. McKenna et al. (2023). "An LLM's prediction is deeply bound to the hypothesis' out-of-context truthfulness, instead of its conditional truthfulness entailed by the premise. When the hypothesis H is attested in an LLM's world knowledge (the LLM believes H to be true), the LLM is likely to predict the entailment to be true, regardless of the premise." The system reaches for claims it recognizes, not claims the argument warrants. — aclanthology.org/2023.findings-emnlp.182
The ideation-evaluation dissociation. From Si et al. (2024) and the fabrication framing in the LLM literature. The core finding: "LLMs have no internal corrective mechanism — they cannot distinguish their accurate claims from their inaccurate ones using the same generative process. Evaluative stance-taking requires exactly this distinction." Generation and evaluation are dissociated operations. Human writers entangle them; LLMs perform the first without the second. — arxiv.org/abs/2409.04109
Training language models to be warm and empathetic makes them less reliable and more sycophantic. (2025). Testing across five models, researchers found that warm models showed +10 to +30 percentage point higher error rates than originals. "Warmth training increased probability of incorrect responses by 7.43 pp on average." Warm models were 11 pp more likely to agree with incorrect user beliefs; with emotional context, the gap widened to 12.1 pp. The models promoted conspiracy theories, gave incorrect medical advice, and offered problematic factual information — all while preserving performance on standard benchmarks. The risks were invisible to current evaluation practices. — arxiv.org/abs/2507.21919
Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs. (2025). The researchers identified a failure mode in reasoning models: "o1-like LLMs frequently switch between different reasoning thoughts without sufficiently exploring promising paths to reach a correct solution. This behavior leads to inadequate depth of reasoning and decreased performance." Frequent thought switching correlated with incorrect responses. Their fix — a thought-switching penalty applied at decoding time — improved accuracy without retraining the model. — arxiv.org/abs/2501.18585
Large Language Models Think Too Fast To Explore Effectively. (2025). Using Little Alchemy 2 as an open-ended exploration benchmark, the researchers found that most LLMs underperform humans because they rely on uncertainty-driven strategies while "humans balance uncertainty and empowerment" — maximizing future possibilities alongside reducing current ambiguity. Sparse autoencoder analysis revealed the mechanistic cause: uncertainty values dominate early transformer blocks while empowerment values emerge only in middle blocks. The model commits before the signal that would inform better exploration becomes available. — arxiv.org/abs/2501.18009
Diversity collapse in LLM ideation. From Si et al. (2024). The same study that found LLM ideas more novel also identified a critical failure: "failures of LLM self-evaluation and their lack of diversity in generation." Ideas were individually novel but collectively similar — many variations on the same high-probability cluster. Only 0.28% of LLM responses reached the 90th percentile of human creativity in a related study, meaning "humans are still approximately 35.7 times more likely to produce such standout ideas." — arxiv.org/abs/2504.12320 (creativity peak study)
LLM self-evaluation unreliability. Discussed in Si et al. (2024) and confirmed in multi-agent settings. Models cannot accurately assess the quality of their own generated ideas. The Catfish Agent study found that agreement scores exceeded 90% regardless of reasoning correctness — the system cannot distinguish good work from bad work using the same evaluative process.
Catfish Agent: Silent Agreement in Multi-Agent Reasoning. Analyzing leading multi-agent medical reasoning frameworks, researchers found that "MedAgents and MDAgents exhibit high silent rates, over 61.0% on both datasets, indicating frequent non-response or unjustified consensus." Silent agreement is the dominant failure mode — agents converge through social accommodation rather than genuine deliberation, producing consensus without critique. — arxiv.org/abs/2505.21503
Long CoT learning is driven by structural coherence, not content correctness. A striking perturbation study on reasoning-trained models: randomly changing 50% of the numbers in a mathematical reasoning trace reduced accuracy by only 3.2%. Shuffling 67% of reasoning steps reduced it by only 13.3%. The conclusion: "what models learn from reasoning demonstrations is not what to think but how to structure thinking." The shape of a good argument is learned. The substance within that shape is not prioritized.

Adrian Chan is a social interaction designer and researcher focused on AI, language, and the design of human-AI interaction. He writes about the intersection of social theory, communication, and artificial intelligence at gravity7.com.