Teaching Models New Languages

At Cohere For AI, we're investigating an innovative approach to enhancing multilingual capabilities in existing language models by conceptualizing different languages as distinct modalities. Our hypothesis is that, as multimodal models evolve and become more sophisticated, treating different languages similarly to how visual modalities are treated could significantly expand a model's linguistic versatility.

To validate and develop this idea, we are utilizing the LLaVA (Large Language and Vision Assistant) architecture. Initially developed for integrating visual and linguistic information, LLaVA combines pretrained large vision models (e.g., CLIP or ViT) with LLMs to facilitate sophisticated vision-language understanding. The core idea behind LLaVA is to seamlessly map visual input data into a language-understandable format, enabling the language model to interpret visual contexts effectively.

We used Cohere for AI's Aya dataset to retrieve language data, most primarily in Turkish and Japanese, approximately 20 Million tokens each from a variety of conversation and Q&A sentences. After tokenizing this information, we started the LLaVA approach, which operates through a representation and mapping model.

Representation Model

The representation model's primary purpose is to transform raw input data from a specific modality—in LLaVA's original context, visual data—into meaningful embeddings

In our multilingual adaptation, the representation model is a transformer-based MLM. We use tokenizers to process text inputs from different languages and generate embeddings capturing linguistic structures, semantics, and cultural contexts. The training involves masking tokens randomly and predicting masked tokens using the MLM objective.

\mathcal{L}_{\text{MLM}} = -\sum_{i \in M} \log P(x_i | x_{\setminus i})

where $M$ is the set of masked tokens, $x_i$ represents masked tokens, and $x_{\setminus i}$ denotes the rest of the context.

Mapping Model

The mapping model's primary role is to bridge the gap between the representation model's embeddings and the target LLM's embedding space. We implement this mapping through a MLP, using GELU activation and dropout to ensure robust learning. This mapping function is trained to minimize the difference between the language model's embeddings and the newly generated embeddings from the representation model. The training objective is defined as:

\mathcal{L}_{\text{mapping}} = \frac{1}{N}\sum_{n=1}^{N} || f_{\text{mapping}}(\mathbf{e}_n) - \mathbf{e}_{\text{LLM}, n} ||_2^2

where $\mathbf{e}$ represents embeddings from the representation model, $\mathbf{z}$ represents mapped embeddings, and $N$ is the batch size.

Llava Architecture

By extending LLaVA's multimodal philosophy into multilingual domains, we can give existing language models multilingual capabilities. We are currently in the process of mapping Microsoft's Phi to learn new languages, followed by a post-training step for Q&A fine-tuning.

Cohere For AI

Project Info

Teaching Models New Languages

Representation Model

Mapping Model