Home Community This AI Paper from SambaNova Presents a Machine Learning Method to Adapt Pretrained LLMs to Recent Languages

This AI Paper from SambaNova Presents a Machine Learning Method to Adapt Pretrained LLMs to Recent Languages

0
This AI Paper from SambaNova Presents a Machine Learning Method to Adapt Pretrained LLMs to Recent Languages

The rapid advancement of enormous language models has ushered in a brand new era of natural language processing capabilities. Nevertheless, a big challenge persists: most of those models are primarily trained on a limited set of widely spoken languages, leaving an enormous linguistic diversity unexplored. This limitation not only restricts the accessibility of cutting-edge language technologies but in addition perpetuates a technological divide across linguistic communities.

Researchers have tackled this challenge on this study by proposing a novel AI method named SambaLingo. This approach goals to adapt existing, high-performing language models to recent languages, leveraging the strengths of pre-trained models while tailoring them to the unique characteristics of the goal language.

Previous efforts to deal with this issue have primarily focused on training monolithic multilingual or language-specific models from scratch. Nevertheless, these approaches face significant hurdles, including the curse of multilinguality, data scarcity, and the substantial computational resources required. Adapting English-centric models to recent languages has emerged as a promising alternative, demonstrating the potential to outperform language-specific models pre-trained from scratch.

The SambaLingo methodology begins with the choice of an appropriate base model that has already exhibited exceptional performance in its initial language. On this study, the researchers selected the open-source Llama2 7B model, renowned for its English language capabilities, as their start line.

To effectively capture the linguistic nuances of the goal language, the researchers expanded the model’s vocabulary by adding non-overlapping tokens from the goal language and initializing them using sub-word embeddings from the unique tokenizer. This significant step ensures that the model can accurately tokenize and represent the brand new language, paving the way in which for seamless adaptation.

Next, the researchers employed a approach, feeding the model a rigorously curated mixture of English and goal language web data sourced from CulturaX. The information mixture followed a 1:3 ratio, biased towards the goal language, to strike a fragile balance between preserving the model’s existing knowledge and adapting it to the brand new linguistic landscape.

To further enhance the model’s alignment with human preferences, the researchers implemented a two-stage process: (SFT) and (DPO). During SFT, they utilized the ultrachat-200k dataset and its machine-translated version. For DPO, they employed ultra feedback and cai-conversation-harmless datasets, mixing them with a ten:1 ratio of English to machine-translated data.

The researchers rigorously evaluated the SambaLingo models across various tasks and languages, including language modeling, translation, text classification, open-book and closed-book query answering, and various natural language understanding benchmarks as shown in Table 1. The models were tested on nine typologically diverse languages: Arabic, Thai, Turkish, Japanese, Hungarian, Russian, Bulgarian, Serbian, and Slovenian.

Across multiple benchmarks, the SambaLingo models consistently outperformed existing state-of-the-art models in these languages. For example, on the perplexity benchmark, which measures language modeling performance, the SambaLingo models achieved lower perplexity scores than all existing baselines on a held-out set from their training data (Figure 1). Moreover, when scaled to the larger Llama2 70B parameter scale, the SambaLingo models exhibited even higher performance, surpassing their 7B counterparts across multiple benchmarks, despite being trained on fewer tokens.

To validate the standard of the model’s outputs and their alignment with human preferences, the researchers employed GPT-4 as an impartial judge, evaluating the model’s responses to real user prompts. The outcomes were promising, with SambaLingo consistently outperforming other models in the identical languages, as judged by GPT-4’s preferences and logical explanations.

In summary, the SambaLingo methodology represents a big stride towards democratizing artificial intelligence across linguistic diversity. By leveraging the strengths of existing high-performing models and tailoring them to recent linguistic landscapes, this approach offers a scalable and efficient solution to the challenge of language barriers. With its state-of-the-art performance and alignment with human preferences, SambaLingo paves the way in which for a future where the advantages of AI transcend linguistic boundaries, fostering inclusivity and accessibility for all.


Try the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram ChannelDiscord Channel, and LinkedIn Group.

Should you like our work, you’ll love our newsletter..

Don’t Forget to affix our 40k+ ML SubReddit


Need to get in front of 1.5 Million AI Audience? Work with us here


Vineet Kumar is a consulting intern at MarktechPost. He’s currently pursuing his BS from the Indian Institute of Technology(IIT), Kanpur. He’s a Machine Learning enthusiast. He’s captivated with research and the most recent advancements in Deep Learning, Computer Vision, and related fields.


🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and lots of others…

LEAVE A REPLY

Please enter your comment!
Please enter your name here