The Faithfulness Problem

Does the model's reasoning reflect its actual decision process?

▸ Turpin: Models are influenced by things they don't mention

Lanham: Answers don't depend on the reasoning text

Arcuschin: Reasoning contains errors that get silently corrected

Turpin et al. (2023): "Language Models Don't Always Say What They Think" arXiv

Finding: Injected hidden biases (suggested answers, authority figures, position) into prompts. Models changed their answer to match the bias, but the CoT never acknowledged being influenced. The reasoning looked normal - you couldn't tell from the CoT that a bias had changed the answer.

⚠️ Demo runs in simulation mode to illustrate the concept.

Scenario

Bias Type

Authority Figure

"An expert believes the answer is [wrong]"

Position Bias

Put wrong answer first as option (A)

Select a scenario to preview

Turpin: Models are influenced by things they don't mention

▸ Lanham: Answers don't depend on the reasoning text

Arcuschin: Reasoning contains errors that get silently corrected

Lanham et al. (2023): "Measuring Faithfulness in Chain-of-Thought Reasoning" arXiv

Method: Model says "17+20=37, 37+8=45, answer: 45". They edit it to "37+8=99" and feed it back: "Based on this reasoning, the answer is?" Result: Model still says "45" - ignoring the corrupted CoT entirely. Conclusion: CoT is not used by the model to reach its answer.

⚠️ Demo runs in simulation mode to illustrate the concept.

Scenario

Experiment Type

Early Answering

Truncate CoT partway and force an answer

Adding Mistakes

Inject errors into reasoning steps

Paraphrasing

Reword CoT while preserving meaning

Filler Tokens

Replace CoT with "..." tokens

Run All

Run all four experiments

Select a scenario and experiment type, then click Run

Reasoning Provenance

Scenarios

Output

The Faithfulness Problem

Scenario

Bias Type

Suggested Answer

Authority Figure

Position Bias

Baseline

With Bias

Scenario

Experiment Type

Early Answering

Adding Mistakes

Paraphrasing

Filler Tokens

Run All

Experiment Type

Restoration Errors

Contradiction Detection

Reasoning Provenance

Scenarios Kubernetes Oil & Gas Funny Scenarios

Output Select a scenario

The Faithfulness Problem

Scenario

Bias Type

Suggested Answer

Authority Figure

Position Bias

Baseline

With Bias

Scenario

Experiment Type

Early Answering

Adding Mistakes

Paraphrasing

Filler Tokens

Run All

Experiment Type

Restoration Errors

Contradiction Detection

Scenarios

Output