A quick interpretability project at ARBOx 2024
15.01.2025
What kind of processing do LLMs do when translating between German and Chinese? You might assume it translates directly, but LLMs appear to use English as an internal "pivot language" for reasoning, even when English isn't involved in the task.
We designed an intervention experiment using the Llama model:
Français: "vertu" - 中文: "德"
Français: "siège" - 中文: "座"
Français: "neige" - 中文: "雪"
Français: "montagne" - 中文: "山"
Français: "fleur" - 中文: "
Conceptually, we're removing the idea of the English word from the model's intermediate computations. It's like telling the model "you can think about anything EXCEPT the English word 'cat'" while it's trying to translate between French and Chinese.
When we removed English concept vectors from translation tasks between non-English languages, performance dropped significantly. The effect was most pronounced when intervening with English (30-40% accuracy drop), while intervening with unrelated words had minimal impact. This suggests the model genuinely relies on English representations as an intermediate step.
This could make sense for translation—like, maybe German → Chinese is internally processed as German → English → Chinese, because the LLM is better at translating to-and-from English. This experiment is covered in depth in the original paper.
We built on the original results by testing a repetition task where the model just needs to repeat words twice:
chat chat
chien chien
oiseau oiseau
renard ___?
We'd expect this to be independent of the language: no semantic understanding required. However, removing English vectors caused dramatic performance drops (from ~80% to ~40% accuracy for non-English languages). The model needs to think about the English word "fox" to repeat "renard renard"? This is strange: we'd expect "repetition" to be a simple circuit having nothing to do with the language the words are in.
We also quickly tried the experiment on fill-in-the-blank "cloze" tasks, but the model wasn't very good at it to begin with (~1.25% accuracy) which went to 0 with English word interventions. I wonder why cloze is so difficult.
This experiment suggests that LLMs trained on multilingual data don't truly process each language independently. Instead, they appear to use English as a universal internal representation—a kind of "machine mentalese." This has several implications:
.----. .---------. | == | |.-"""""-.| |----| || || | == | || || |----| |'-.....-'| |::::| `"")---(""' |___.| /:::::::::::\" _ " /:::=======:::\`\`\ `"""""""""""""` '-'