10.27.2025
Rajan Agarwal*, Aarush Gupta*
* Core Contributor; Correspondence to r34agarw@uwaterloo.ca, hiarrushgupta@gmail.com
Instruction-tuned Large Language Models (LLMs) underperform on low resource, non-Latin scripts due to tokenizer fragmentation and weak cross-lingual coupling. We present LLINK (Latent Language Injection for Non-English Knowledge), a compute efficient language-as-modality method that conditions an instruction-tuned decoder without changing the tokenizer or retraining the decoder. First, we align sentence embeddings from a frozen multilingual encoder to the decoder's latent embedding space at a reserved position via a lightweight contrastive projector. Second, the vector is expanded into K soft slots and trained with minimal adapters so the frozen decoder consumes the signal. LLINK substantially improves bilingual retrieval and achieves 81.3% preference over the base model and 63.6% over direct fine-tuning in LLM-judged Q&A evaluations. We further find that improvements can be attributed to reduced tokenization inflation and a stronger cross lingual alignment, despite the model having residual weaknesses in numeric fidelity. Treating low resource languages as a modality offers a practical path to stronger cross-lingual alignment in lightweight LLMs.
Khmer text is a worst-case scenario for English-heavy byte pair encoders. The same sentence tokenizes to about 16 tokens in English, 35 in Latin transliteration, and 104 in native Khmer with the Llama-3.2 vocabulary. That six-fold inflation blows through context windows, drives quadratic attention cost before the model even reaches the instruction, and starves the decoder of signal from later tokens. Parameter-efficient finetuning techniques such as LoRA inherit the problem because they still run on the fragmented token stream, teaching the model to cope with junk tokens rather than eliminating them.
LLINK shifts the heavy lifting to a compact encoder and a handful of soft slots. Instead of forcing the decoder to untangle dozens of rare-script fragments, the approach injects a semantic summary that already lives in the decoder's hidden space. The decoder stays frozen, the tokenizer stays untouched, the prompt stays short, and Khmer prompts drop from triple-digit token counts to eight learned slots. In the ParaCrawl evaluation this reduction alone accounts for the majority of the cross-lingual retrieval gains seen after Stage A.
A frozen XLM-R encoder produces sentence embeddings that are mean pooled and sent through a small projector (768 to 3072 to 2048 with GELU, dropout 0.10, and LayerNorm). The target is the decoder's hidden state at a reserved slot inside a prompt template that appends a placeholder token to the user instruction. Symmetric InfoNCE with in-batch negatives and a queue of 32768 teacher vectors handles alignment, mining the 256 hardest negatives per step so the projector learns to ignore look-alike contexts. Lightweight direction and log-norm penalties keep the projected vector close in both angle and magnitude. No decoder or tokenizer weights move, so the decoder experiences the injected vector as another internally generated state.
The aligned vector is expanded into eight reserved tokens (f0 to f7) that sit in the prompt like ordinary context. LoRA adapters (rank 16, alpha 16) on attention and MLP projections, along with a learned slot scaler and expander, are trained on synthetic instruction-following tasks. Every third step the pipeline compares the supervised loss with slots against a variant where slots are zeroed and penalizes improvements with max(0, L_sft - L_zero). Auxiliary cosine and InfoNCE terms prevent drift away from the Stage A target, and the synthetic prompt set covers translation, summarization, bullet points, and question answering so the decoder has multiple ways to rely on the injected signal. A final normalization matches slot norms to the median embedding norm so the decoder treats the injected tokens as native context.
Architecture callout

Figure 1 in the paper diagrams the flow: Khmer text enters the frozen XLM-R encoder, a contrastive projector drops the sentence vector into a reserved decoder position, and the expanded slots hand the content to the instruction-tuned Llama decoder, which responds in English without ever seeing Khmer tokens.
Stage A uses 100k ParaCrawl v2 English-Khmer pairs with Khmer strings truncated at 256 characters and a 40k pair holdout for retrieval evaluation. Stage B relies on 40k synthetic instruction examples plus 2k validation, generated by prompting a Llama 3 70B model with the English reference to anchor outputs. Prompts cover tasks such as translate to English, summarize in English, bullet pointify, and question answering about the passage, with filtering to keep Khmer inputs between 12 and 256 characters and ensure the reserved slot appears.
The base Llama-3.2-1B decoder remains frozen throughout. LoRA rank is 16 with alpha 16, the slot count stays at eight, and the usage contrast is applied every third optimization step. Mixed precision (fp16 queue, bf16 projector) keeps training lightweight, and the injection pipeline mirrors inference exactly: encode Khmer with XLM-R, project, expand into slots, then decode.
Retrieval on a 1024-pair benchmark ranks each Khmer sentence against all English candidates. The contrastive bridge alone explains most of the improvement, proving that bypassing fragmented tokenization matters more than extra decoder tuning. Stage B helps primarily by enforcing slot usage, which pays off in generation tasks.
LLM-as-judge experiments follow the paper's Table 2 protocol: a Llama 3 70B model receives anonymized outputs and the human reference. LLINK wins 87 percent of understanding comparisons and 80 percent of Q and A comparisons against the base model, and keeps a 67 percent versus 59 percent edge over direct finetuning across those buckets. Ties remain substantial (15 to 36 percent), signaling space for future lexical fidelity work.
Qualitative examples in Table 3 highlight both sides: LLINK grounds answers like policy statements and class schedules that the base model garbles, yet it can still drift on units (30 MW becomes 1.5 kW) or over-summarize when the slot compression blurs rare terminology.
Treating Khmer as an injected modality swaps autoregressive processing of 100-plus tokens for a single encoder pass and eight decoder slots. The encoder cost can be cached across follow-up questions, enabling roughly three times fewer decoder tokens per prompt and allowing retrieval or question answering systems to reuse the same slot vector across multiple prompts. This shift preserves the frozen decoder's English strengths while sparing the context window for instructions, exemplars, or longer answers, and it plays nicely with downstream batching because slot tokens are constant length.
Eight slots act like a semantic bottleneck. They preserve meaning but not always surface form, so numbers, dates, and rare entities can drift. The decoder remains frozen and English-dominant, so without explicit slot supervision it may paraphrase around foreign content. The usage contrast keeps the model honest but does not guarantee exact copy behavior.
These weaknesses show up in judge evaluations as preference losses on prompts that demand literal translations. LLINK excels at question answering and summarization, but still trails a carefully curated translation system when the task is to mirror a sentence verbatim.