28233
views
✓ Answered

When Chinese Prompts Yield Korean Responses: The Role of Code in Language Model Embeddings

Asked 2026-05-17 20:26:20 Category: Finance & Crypto

Introduction

Imagine typing a question in Chinese to your coding assistant, only to receive an answer entirely in Korean. This puzzling phenomenon is not a glitch but a fascinating glimpse into how large language models (LLMs) organize meaning across languages. The root cause lies in the way these models represent words and concepts—through high-dimensional embedding spaces—and how the vocabulary of programming languages can unexpectedly blur linguistic boundaries.

When Chinese Prompts Yield Korean Responses: The Role of Code in Language Model Embeddings
Source: towardsdatascience.com

The Role of Embeddings in Language Models

What Are Embeddings?

Embeddings are mathematical vector representations that capture semantic relationships between words. In a multilingual model, words from different languages that share similar meanings are placed close together in the same embedding space. For example, the English word "dog," the Chinese word "狗" (gǒu), and the Korean word "개" (gae) all occupy nearby vectors because they refer to the same animal.

How Multilingual Embeddings Work

Training a multilingual LLM involves aligning embeddings across languages so that the model can transfer knowledge. This alignment is achieved by exposing the model to parallel texts (e.g., translated sentences) and using techniques like cross-lingual pre-training. The result is a shared semantic space where a query in one language can activate similar concepts in another—and, as we'll see, sometimes trigger an unexpected language switch.

Code as a Lingua Franca: Overlap in Embedding Spaces

Programming languages like Python, JavaScript, and SQL are themselves a form of language—one that is largely international. Keywords such as if, for, and return appear identically in code written by developers worldwide. When a model is trained on a massive corpus of code and natural language, these code tokens become powerful anchors that link natural languages.

For instance, the Chinese prompt "写一个函数来计算斐波那契数" (write a function to compute Fibonacci numbers) contains natural language mixed with code-like structures. The model's embedding space may find that the Chinese word "函数" (function) is very close to the Korean word "함수" (hamsu) because both are frequently paired with the same code tokens in training data. This proximity can cause the model to "drift" toward Korean when it encounters code-heavy Chinese input.

Case Study: From Chinese Prompt to Korean Response

Analyzing the Embedding Shift

In the reported incident, the assistant received a Chinese instruction involving a coding task. Instead of replying in Chinese, it answered in Korean. To understand why, imagine the embedding space as a landscape: the Chinese prompt creates a cluster of activated vectors. But because code vocabulary acts as a bridge, nearby Korean vectors become strongly activated. If the model's language-identification mechanism is weak or if the Korean vectors are more dominant for that specific concept, the response defaults to Korean.

When Chinese Prompts Yield Korean Responses: The Role of Code in Language Model Embeddings
Source: towardsdatascience.com

This effect is amplified when the prompt includes common programming jargon. Terms like "API," "JSON," or "loop" are almost identical across many languages. The model may treat the whole prompt as being in a mixed code-language register, and it chooses the most natural language for the reply—often the one with the highest activation in the region of the embedding space.

Implications for Multilingual AI Assistants

  1. User Experience: Unexpected language shifts can confuse users and reduce trust. Users expect their assistant to reply in the language they used, especially when they explicitly typed in a specific language like Chinese.
  2. Model Design: Developers must fine-tune language identification layers and maybe add explicit language tokens to prevent drift. For example, prompting the model with "Please respond in Chinese" can help, but it's not foolproof if the embedding space is too entangled.
  3. Research Opportunities: These incidents offer a natural experiment to study how multilingual LLMs separate (or fail to separate) languages. Improving our understanding of embedding spaces can lead to more robust cross-lingual models.

Conclusion

The case of a coding assistant replying in Korean to a Chinese prompt is not a random error—it's a direct consequence of how embedding spaces represent language and code. As AI assistants become more integrated into global workflows, developers must be aware of these subtle cross-linguistic effects. By carefully designing training data and language control mechanisms, we can ensure that assistants stick to the language we intend—so you can keep coding in Chinese and get answers in Chinese, not a surprise language switch.