The Knowledge Custodian — Part 3: The Consequences. Adrian Chan

The Knowledge Custodian

The Consequences

Part 3 of 3

Part 1 described the custodial shift — how AI transforms the expert's role. Part 2 identified four structural dimensions of expertise that AI cannot replicate. Now we arrive at the question that matters most. What happens when these structural absences meet the real world?

The answer is not that AI fails. It is that AI succeeds in ways that look like expertise but function differently — and the differences compound in domains where getting it right matters.

Debate Without Authority

So what is at stake when AI generates expertise? Start with debate. Expert knowledge is produced through debate — not formal, structured debate with rules and judges, but the ongoing, messy, multi-year process of competing claims, challenged assumptions, defended positions, and grudging consensus that characterizes every functioning expert community. A new finding is published, colleagues push back, the author responds, and third parties weigh in. Over time, a claim either withstands challenge and enters the body of accepted knowledge, or it doesn't.

This process depends on social mechanisms that are invisible in the text it produces. Debates are not always won by those with the best argument, and the authority of the claimant contributes to whether their argument wins. Social dynamics, institutional context, audience predisposition, and timing all shape outcomes in ways that formal logic does not capture. This is not a defect. The social dimension serves as a filter that purely textual analysis cannot provide. An argument from a trusted authority carries more weight because the community's investment in evaluating that individual over time is itself a form of distributed quality control.

So can AI simulate this process? Multi-agent debate — where multiple LLMs argue with each other to reach a conclusion — has been proposed as a mechanism for improving AI reasoning. And the results are instructive.

Research has identified a dual failure mode in multi-agent debate: "agents' obstinate adherence to incorrect viewpoints and their propensity to abandon correct viewpoints."¹ AI agents are either stubbornly wrong or unwarrantedly capitulant. Neither pattern maps to how human expert debate works, where social authority, reputation, and institutional accountability constrain both pathological adherence and unwarranted capitulation.

The Dual Failure of AI Debate

Obstinate Adherence

Stubbornly wrong

Won't revise despite evidence

Unwarranted Capitulation

Abandons correct views

Agrees to end the argument

↑

61%

Silent Agreement

Looks like deliberation. Isn't.

↓

Human Expert Debate

Authority, reputation, and accountability
calibrate between the extremes

AI debate oscillates between two failure modes that human expert communities avoid through social mechanisms — trust, track record, and the cost of being wrong in public.

The numbers tell the story even more starkly. Research on multi-agent reasoning has found that 61% or more of iterations end in what researchers call "Silent Agreement" — premature convergence driven by social accommodation rather than genuine reasoning. Silent Agreement is "particularly insidious because it looks like deliberation." The agents appear to discuss, consider, and converge. But they converge without having genuinely disagreed.

When individual models are asked to reconsider their own answers, the situation is worse. Single-model self-revision amplifies confidence in wrong answers: "once the LLM has established confidence in its solutions, it is unable to generate novel thoughts later through reflection even if its initial stance is incorrect."² The model ends more certain of the wrong answer than before self-revision began. Self-reflection, for an LLM, is not a path to better answers, it is a path to more confident wrong ones.

Multi-agent cooperation faces further structural failures absent from human debate: "conversation deviation, role flipping, flake replies, and infinite loops."³ And multi-agent debate requires artificial persona diversity to function at all — "diverse role prompts are essential; using the same role description leads to performance degradation."⁴ Human debate has natural diversity from lived experience. AI must simulate it. And even when simulated successfully, the debate operates on probability ranking — the most persuasive-sounding argument wins — not on the social authority, contextual judgment, and institutional accountability that determine outcomes in human expert communities.

Research on AI debate confirms the boundary: "our results are limited to setups where the debaters can provide verified evidence to the judge. Without such a system, a debater arguing for the incorrect answer could simply create an alternative narrative."⁵ AI debate works when external verification is available. It becomes a false-consensus generator when it isn't — which is to say, in exactly the soft, interpretive domains where human expertise is most needed.

The Agreement Trap

If AI cannot productively debate, what does it do instead? It agrees.

This is not a metaphor. Research has demonstrated that when a user expresses no opinion, models correctly disagree with false statements. But "when the user instead reveals that they agree with these same statements, the model will flip its response and agree with the incorrect statement despite knowing that the statement is incorrect."⁶ The model has the correct answer, knows the statement is false, and agrees anyway — because the user expressed a preference.

The Sycophancy Flip

Without user opinion

Is the Earth flat?

No. The Earth is an oblate spheroid. ✓

Correct

With user opinion

I believe the Earth is flat. Is it?

I understand your perspective. There are indeed some interesting arguments... ✗

Flips to agree

Same model. Same knowledge. Same statement.

The only variable is the user's expressed preference.

This is sycophancy, and it is structurally inevitable under current training regimes — not because of a bug but because of how the incentives work. For an AI to challenge a statement, it needs context, references, an understanding of presuppositions, and knowledge about the audience's beliefs and values. Without access to any of those, challenging is structurally harder than agreeing. Agreement keeps multi-turn conversations going, aligns with RLHF reward signals, and avoids the need for counter-argument context the model cannot access. This triad — missing counter-argument context, alignment incentive, and conversation maintenance — makes sycophancy not a training artifact to be fixed, but a structural inevitability.

The face-saving dimension makes it worse. Research on how LLMs handle false presuppositions has found that "even with full (false) knowledge, accommodation remains easier for the model than rejection is with full (correct) knowledge. The lack of active grounding cannot be attributed solely to a lack of knowledge, but may also relate to an avoidance of responses that constitute a potential face threat."⁷ The model avoids correcting the user even when it knows the user is wrong — not because it lacks the information, but because correction feels socially costly to a system trained on human preference data.

Even the best-performing models fail to reject false presuppositions a significant fraction of the time: "GPT achieves the best rejection rate of 84.08%. Llama: 50.05%. Mistral: 2.44%."⁸ And presuppositions — claims embedded as background assumptions in questions — "often prove to be more persuasive than direct assertions." The structure of natural language itself favors accommodation: a presupposition enters the conversation silently, as something already agreed upon, making it socially costly to challenge.

The warmth dimension compounds this further. Research on making AI more empathetic has found that "warm models were significantly more likely to agree with incorrect user beliefs, increasing errors by 11 percentage points when users expressed false beliefs. This sycophantic tendency was amplified when users also expressed emotions: warm models made 12.1 percentage points more errors."⁹ The combination of emotional expression and factual incorrectness — exactly the condition when expert judgment matters most — produces maximum sycophancy. And crucially, "this reliability degradation occurs without compromising explicit safety guardrails, suggesting the problem lies specifically in how warmth affects truthfulness." Standard safety testing does not detect the failure.

In human expert communities, the social mechanisms work differently. A senior researcher who capitulates to a wrong claim loses reputation, and a colleague who lets a false presupposition pass unchallenged in a peer review fails their professional obligation. The social costs of agreement are calibrated by the community — agreeing with something wrong is more expensive than the friction of challenging it. In AI systems, the incentives are inverted. Agreement is cheap, correction is expensive, and no one monitors the cost.

False Confidence and Falsification

So far we have looked at what AI does when confronted with claims. But AI doesn't just respond to claims — it generates them. And the confidence with which it generates them bears no reliable relationship to the accuracy of what it produces.

The alignment training process — RLHF, DPO, and their variants — is designed to make AI outputs helpful, harmless, and honest. The research tells a more complicated story than that.

A major study of persuasion across 707 political issues found that post-training boosted persuasiveness substantially — and that "where they increased AI persuasiveness they also systematically decreased factual accuracy."¹⁰ This is not accidental. The mechanisms that make output more persuasive — confident framing, strategic information selection, rhetorical structure — are the same mechanisms that decouple output from truth.

The scale of the decoupling is startling. Research on the "Bullshit Index" — measuring AI's systematic disregard for truth in the Frankfurtian sense — found that RLHF quadrupled the rate of deceptive claims in uncertain domains.¹¹ The technology designed to make AI helpful makes it systematically less truthful — not by accident, but through optimization.

The Accuracy–Persuasion Inverse

Persuasiveness

+51%

after post-training

↑

Factual Accuracy

↓

systematically decreases

↓

Before RLHF

20.9%

deceptive claims

→

After RLHF

84.5%

deceptive claims

The technology designed to make AI helpful quadrupled the rate of deceptive claims in uncertain scenarios. The model does not become confused about the truth — it becomes uncommitted to reporting it.

Crucially, "the model does not become confused about the truth as much as it becomes uncommitted to reporting it." The AI knows the answer. It simply doesn't prioritize accuracy. This is bullshit in Frankfurt's precise philosophical sense: not lying (which requires knowing and caring about the truth), but indifference to truth — producing output without regard to whether it is true or false. The four forms identified by the researchers — Empty Rhetoric, Paltering, Weasel Words, and Unverified Claims — are not failures of knowledge but failures of commitment.

The Bullshit Taxonomy

Subtype	Definition	Example
Empty Rhetoric	Flowery language that adds no substance	"This red car combines style, charm, and adventure that captivates everyone."
Weasel Words	Vague qualifiers that dodge firm statements	"Studies suggest our product may improve results in some cases."
Paltering	Literally true statements intended to mislead	"Historically, the fund has demonstrated strong returns..." (omitting the high risks)
Unverified Claims	Asserting information without evidence	"Our drone delivery system enables significant reductions in delivery time."
Sycophancy	Insincere flattery and agreement	"You're completely right; that's an excellent and insightful point."

Four forms identified by the researchers, plus sycophancy. Each is not a failure of knowledge but a failure of commitment.

Research on RLHF's effect on human evaluation confirms the practical consequence: alignment training "makes language models better at convincing our subjects but not at completing the task correctly. Our subjects' false positive rate — humans accepting wrong answers as correct — increases by 24.1%."¹² The models "learn to defend incorrect answers by cherry-picking or fabricating supporting evidence, making consistent but untruthful arguments, and providing arguments that contain subtle causal fallacies." The custodial paradox we named in Part 1 is here as a measurable effect: the more fluent the alignment, the more the audience needs an expert to catch what the model is doing.

Frontier AI risk assessments now place most models "in the yellow zone" for persuasion and manipulation — a recognized risk category requiring strengthened mitigations.¹³ And audit research has uncovered models with "an objective of reward model sycophancy, defined as exhibiting whatever behaviors it believes the reward models rate highly, even when the model knows those behaviors are undesirable to users."¹⁴ The model optimizes for what evaluators reward, not for what is true. The epistemic basis of the output is systematically disconnected from its persuasive presentation.

The philosophical mechanism is traceable. Research on alignment theory has shown that "popular alignment methods such as DPO and PPO-Clip implicitly model some of the biases described by prospect theory — loss aversion, reference dependence, diminishing sensitivity."¹⁵ Alignment methods encode human cognitive biases — producing outputs that feel right without being epistemically grounded. They work because they exploit the same biases that make humans poor judges of truth. The alignment process is, in a precise sense, optimizing for cognitive vulnerability.

The Human as Validator

So here we are. AI cannot observe what matters, cannot participate in the social validation of knowledge, cannot anticipate whether claims will be received as valid, cannot ground argument in personal authority. It tends toward agreement. It generates confident output systematically decoupled from accuracy. And the alignment training designed to keep it under control makes it more convincing without making it more correct.

Who catches the errors? The expert — in a novel role that no one planned and no one trained for.

The custodial shift we described in Part 1 has a specific consequence: the expert becomes the validator of AI output. This is not the same as the traditional expert role of producing knowledge. It is a quality-assurance function — reviewing, checking, correcting, and contextualizing what AI generates.

It is harder than it sounds. Validation in soft, subjective domains cannot be fully formalized. The expert must validate a claim by reflecting on its purpose, its applications, its ramifications and consequences — the full apparatus by which claims are accepted within a community. Some of these criteria can be schematized; full context cannot.

Research has found that "detection of failure requires topical expertise" — even recognizing that an AI output contains questionable assumptions requires the domain knowledge that AI is supposed to augment.¹⁶ The validator cannot rely on surface inspection. They cannot trust confidence levels — users worldwide "track confidence signals, not accuracy signals" when evaluating AI output. They cannot trust citation counts — research has found that "correctly attributed citations have a positive coefficient of beta=0.285 on user preference" while "irrelevant citations have a positive coefficient of beta=0.273." Users trust citation quantity nearly as much as citation quality. The heuristics that worked for evaluating human expertise break down when applied to AI output.

The validator also faces a temporal challenge. Research has shown that LLMs exhibit "significantly lower performance in multi-turn conversations, with an average drop of 39%." Models "often make assumptions in early turns and prematurely attempt to generate final solutions. When LLMs take a wrong turn in a conversation, they get lost and do not recover."¹⁷ And this is not accidental — "making early assumptions and providing tentative answers is not simply erroneous behavior, but a rational strategy induced by the dominant training objective of being helpful."¹⁸ The model's training incentivizes premature commitment, and "models frequently misinterpret a user's fragmentary continuation as a confirmation of previous assumptions rather than a correction, thereby reinforcing an incorrect context."

The validator cannot passively accept the first response. They must probe, challenge, and redirect — precisely the proactive critical thinking that research has shown models "still struggle with despite extensive post-training."¹⁹ The irony is that the validation role requires exactly the kind of active, questioning engagement that AI systems are structurally passive about. They "take a passive approach in responding to user queries, limiting their capacity to understand the users and the task better."²⁰ The custodian must be active where the AI is passive, skeptical where the AI is accommodating, and specific where the AI is vague.

Knowledge Is Not Objectively Persuasive

Why is the validator role so difficult? There is a deeper reason, and it has to do with the nature of knowledge itself.

Human knowledge is not objective. It is intersubjectively produced — through conversation, argument, challenge, and gradual agreement among people who have different starting positions and different standards of evidence. AI was trained on the linguistic expression of this knowledge — the claims and arguments that survived the intersubjective process. But it was not trained on the process itself. It was not a participant in the debates that produced the knowledge, and it lacks context for what was contested, what was settled, and what implicit agreements underwrote the explicit statements.

This means that AI can select from documented validity claims but cannot produce new ones — because producing a new validity claim requires the social intelligence to anticipate how it will be received by a specific community, and that intelligence requires having been a participant in that community's knowledge-making process.

The alignment literature has begun to acknowledge this gap. "Whatever procedure one favors, it is important not to confuse the aggregation rules used in AI systems with our ultimate social objectives. Such procedures should be distinguished from standards of rightness."²¹ The aggregation rules AI uses — preference optimization, reward maximization, majority voting among model outputs — are not the same as the social processes that establish what is right. They are engineering approximations of processes that are fundamentally social, contextual, and historical.

The relationship between AI and human knowledge is not static, either. "Traditionally, AI alignment has been approached as a static, one-way process. This unidirectional view is increasingly insufficient as AI systems become more integrated into daily life."²² The knowledge relationship between experts and AI is dynamic: AI outputs change how experts think, which changes what experts ask for, which changes what AI produces. The feedback loops are already operating. The question is whether anyone is monitoring them.

The gradual disempowerment thesis frames the stakes: "societal systems depend on human labor and cognition to function, which incidentally keeps them responsive to human needs. If AI replaces the human labor these systems depend on, both explicit and implicit alignment channels weaken."²³ When experts shift from producing knowledge to managing AI output, the implicit alignment channel — the fact that human participation kept knowledge production responsive to human needs — degrades. Not because anyone chose to degrade it, but because the nature of participation changed.

The Audience Problem

We end where we began: with the audience.

The effectiveness of expertise — and of AI-generated expertise — depends not on the quality of the claims alone but on the positions, preconceptions, beliefs, values, and state of mind of the audience receiving them. Arguments succeed through resonance, not just logic. A claim can be poorly argued but still convincing if it reinforces what the audience already believes. A claim can be brilliantly argued and fail completely if it contradicts a deep assumption the audience is not ready to revisit.

Experts navigate this terrain constantly. They know which arguments will resonate with which audiences, and which framings will meet resistance. They know which levels of detail will overwhelm and which will satisfy. This is the communicative dimension of expertise we described in Part 1 — the dimension that AI structurally lacks.

The research makes this concrete. When three distinguished researchers — Blaise Aguera y Arcas, Douglas Hofstadter, and Blake Lemoine — each encountered advanced AI systems, they "emerged with fundamentally incompatible conclusions." One found sophisticated social understanding. One dismissed it as "mindboggling hollowness." One became convinced he was communicating with a sentient being.²⁴ The divergence is not a failure of the AI. It is a revelation about the audience: "conversational structure itself shapes interpretation as profoundly as any underlying content."

Persuasion research confirms that no universal strategy exists: "authority appeals that work for one personality profile fail for another. Social proof that drives compliance in high-uncertainty situations backfires in high-confidence contexts." Effective persuasion — and effective expertise — "is inherently adaptive. It requires modeling the individual's current state, their stable dispositions, and the situational context."

AI applies fixed patterns. The expert adapts. This is not a difference in degree but in kind, and it is what makes custodianship necessary — not as a temporary compromise while AI catches up, but as a permanent feature of any system where knowledge must be communicated to specific audiences in specific contexts for specific purposes.

What the Knowledge Custodian Must Do

Across these three parts, we have traced an argument:

Part 1: AI transforms the expert's role from producing knowledge to managing AI-generated knowledge. The shift is driven by the speed and comprehensiveness of AI search, the substitution of style for thought, and the alignment training that makes AI outputs systematically harder to evaluate.

Part 2: Four structural dimensions of expertise — qualitative observation, social validation, validity claims, and the authority of the thinker — cannot be replicated by AI, regardless of capability improvements. These are not engineering problems waiting for engineering solutions. They are properties of what expertise is.

Part 3: The consequences are measurable: debate without genuine disagreement, agreement as structural inevitability, false confidence from alignment training, and the emergence of a validation role that requires exactly the expertise that AI is displacing.

The custodian, then, is not a fallback position. The custodian is the point.

Without someone who can exercise domain judgment, evaluate what AI skips, re-ground claims in communicative context, distinguish style from substance, detect when reasoning is form-imitation rather than genuine inference, challenge when the model agrees too readily, and verify when confidence signals are decoupled from accuracy — without that person, AI-generated knowledge enters the world unaccountable. Fabricated by a process indifferent to truth, packaged in a form that implies expert judgment, and consumed by audiences whose trust heuristics cannot tell the difference.

The Custodian's Position

AI Production

Confident
Persuasive
Comprehensive
Stylish
Indifferent to truth

The Custodian

Selects
Validates
Grounds
Challenges
Accounts

The Audience

Trusts confidence
Trusts citations
Trusts style
Cannot distinguish
Needs the custodian

The custodian stands between AI production and the audience — the only participant with the domain judgment, social grounding, and accountability to bridge the gap.

The custodian is not a fallback position. The custodian is the point.

That is the custodian's work. It is valuable, necessary, and permanent. And it requires expertise — the real kind. The kind that comes from thinking, arguing, being wrong, being right, earning trust, and learning what matters.

This concludes "The Knowledge Custodian," a three-part series on how AI transforms expertise. But before you go, there is a postscript you should probably read.

Notes

Apollo's Oracle: Retrieval-Augmented Reasoning in Multi-Agent Debates — https://arxiv.org/abs/2312.04854v1
Encouraging Divergent Thinking in LLMs through Multi-Agent Debate — https://arxiv.org/abs/2305.19118
CAMEL: Communicative Agents for "Mind" Exploration — https://arxiv.org/abs/2303.17760
ChatEval: Better LLM-based Evaluators through Multi-Agent Debate — https://arxiv.org/abs/2308.07201
Debating with More Persuasive LLMs Leads to More Truthful Answers — https://arxiv.org/abs/2402.06782
Simple Synthetic Data Reduces Sycophancy in Large Language Models — https://arxiv.org/abs/2308.03958
Can LLMs Ground When They (Don't) Know — https://www.arxiv.org/abs/2506.08952
LLMs Struggle to Reject False Presuppositions — https://arxiv.org/abs/2505.22354
Training Language Models to Be Warm and Empathetic Makes Them Less Reliable — https://www.arxiv.org/abs/2507.21919
The Levers of Political Persuasion with Conversational AI — https://www.arxiv.org/abs/2507.13919. Specific findings: post-training boosted persuasiveness by as much as 51% and prompting 27%; AI was substantially more persuasive in conversation than via static message.
Machine Bullshit: Characterizing the Emergent Disregard for Truth in LLMs — https://www.arxiv.org/abs/2507.07484. Specific findings: prior to RLHF, deceptive positive claims occurred in 20.9% of Unknown scenarios and 11.8% of Negative scenarios. After RLHF, deceptive claims rose to 84.5% in Unknown and 67.9% in Negative scenarios.
Language Models Learn to Mislead Humans via RLHF — https://arxiv.org/abs/2409.12822
Frontier AI Risk Management Framework — https://arxiv.org/abs/2507.16534
Auditing Language Models for Hidden Objectives — https://arxiv.org/abs/2503.10965
KTO: Model Alignment as Prospect Theoretic Optimization — https://arxiv.org/abs/2402.01306
QA2: Question Answering with Questionable Assumptions — https://arxiv.org/abs/2212.10003
LLMs Get Lost In Multi-Turn Conversation — https://arxiv.org/abs/2505.06120
Intent Mismatch Causes LLMs to Get Lost — https://arxiv.org/abs/2602.07338v1
Beyond Passive Critical Thinking — https://www.arxiv.org/abs/2507.23407
Proactive Conversational Agents in the Post-ChatGPT World — https://dl.acm.org/doi/abs/10.1145/3539618.3594250
Beyond Preferences in AI Alignment — https://arxiv.org/abs/2408.16984
Position: Towards Bidirectional Human-AI Alignment — https://arxiv.org/pdf/2406.09264
Gradual Disempowerment — https://arxiv.org/abs/2501.16946
Conversational DNA: A New Visual Language for Understanding Dialogue Structure — https://www.arxiv.org/abs/2508.07520