Do LLMs think in English?

A quick interpretability project at ARBOx 2024

15.01.2025

What kind of processing do LLMs do when translating between German and Chinese? You might assume it translates directly, but LLMs appear to use English as an internal "pivot language" for reasoning, even when English isn't involved in the task.

We designed an intervention experiment using the Llama model:

Give the model translation tasks between non-English language pairs (e.g., French → Chinese)

Français: "vertu" - 中文: "德"
Français: "siège" - 中文: "座"
Français: "neige" - 中文: "雪"
Français: "montagne" - 中文: "山"
Français: "fleur" - 中文: "

'Intervene' in the model's internal representations by removing the English concept vectors. Specifically:
1. We tokenize the English word to get its token IDs
2. We get the embedding vectors for those tokens via the model's embedding matrix (this gives us a subspace representing "cat" in English)
3. At each transformer layer, we take the residual stream (that high-dimensional vector carrying all the model's intermediate computations)
4. We project the residual stream onto our English word subspace
5. We subtract that projection from the residual stream

Conceptually, we're removing the idea of the English word from the model's intermediate computations. It's like telling the model "you can think about anything EXCEPT the English word 'cat'" while it's trying to translate between French and Chinese.

Translation Tasks

When we removed English concept vectors from translation tasks between non-English languages, performance dropped significantly. The effect was most pronounced when intervening with English (30-40% accuracy drop), while intervening with unrelated words had minimal impact. This suggests the model genuinely relies on English representations as an intermediate step.

This could make sense for translation—like, maybe German → Chinese is internally processed as German → English → Chinese, because the LLM is better at translating to-and-from English. This experiment is covered in depth in the original paper.

Repetition Tasks

We built on the original results by testing a repetition task where the model just needs to repeat words twice:

chat chat
chien chien  
oiseau oiseau
renard ___?

We'd expect this to be independent of the language: no semantic understanding required. However, removing English vectors caused dramatic performance drops (from ~80% to ~40% accuracy for non-English languages). The model needs to think about the English word "fox" to repeat "renard renard"? This is strange: we'd expect "repetition" to be a simple circuit having nothing to do with the language the words are in.

Cloze Tasks

We also quickly tried the experiment on fill-in-the-blank "cloze" tasks, but the model wasn't very good at it to begin with (~1.25% accuracy) which went to 0 with English word interventions. I wonder why cloze is so difficult.

Conclusion

This experiment suggests that LLMs trained on multilingual data don't truly process each language independently. Instead, they appear to use English as a universal internal representation—a kind of "machine mentalese." This has several implications:

While using a pivot language might be computationally efficient, it raises questions about whether current LLMs truly understand non-English languages or merely translate everything through an English lens.
This English-centric processing likely reflects the predominance of English in training datasets, potentially limiting models' ability to capture concepts that don't translate well into English.
Understanding this pivot mechanism could inform better multilingual model architectures that either embrace this approach explicitly or develop true parallel processing for different languages.