
Significant advancements in speech technology have been remodeled the past decade, allowing it to be incorporated into various consumer items. It takes loads of labeled data, on this case, many 1000’s of hours of audio with transcriptions, to coach an excellent machine learning model for such jobs. This information only exists in some languages. As an example, out of the 7,000+ languages in use today, only about 100 are supported by current voice recognition algorithms.
Recently, the quantity of labeled data needed to construct speech systems have been drastically reduced due to self-supervised speech representations. Despite progress, major current efforts still only cover around 100 languages.
Facebook’s Massively Multilingual Speech (MMS) project combines wav2vec 2.0 with a brand new dataset that incorporates labeled data for over 1,100 languages and unlabeled data for nearly 4,000 languages to deal with a few of these obstacles. Based on their findings, the Massively Multilingual Speech models are superior to the state-of-the-art methods and support ten times as many languages.
Because the biggest available speech datasets only include as much as 100 languages, their initial goal was to gather audio data for a whole lot of languages. In consequence, they looked to spiritual writings just like the Bible, which have been translated into many languages and whose translations have been extensively examined for text-based language translation research. People have recorded themselves reading these translations and made the audio files available online. This research compiled a group of Recent Testament readings in over 1,100 languages, yielding a median of 32 hours of information per language.
Their investigation reveals that the proposed models perform similarly well for female and male voices, regardless that this data is from a selected domain and is usually read by male speakers. Despite the fact that the recordings are religious, the research indicates that this doesn’t unduly bias the model toward producing more religious language. In keeping with the researchers, it’s because they employ a Connectionist Temporal Classification strategy, which is more limited than large language models (LLMs) or sequence-to-sequence models for voice recognition.
The team preprocessed tha data by combining a highly efficient forced alignment approach that may handle recordings which can be 20 minutes or longer with an alignment model that was trained using data from over 100 different languages. To eliminate possibly skewed information, they used quite a few iterations of this procedure plus a cross-validation filtering step based on model accuracy. They integrated the alignment technique into PyTorch and made the alignment model publicly available in order that other academics may use it to generate fresh speech datasets.
There may be insufficient information to coach traditional supervised speech recognition models with only 32 hours of information per language. The team relied on wav2vec 2.0 to coach effective systems, drastically decreasing the amount of previously required labeled data. Specifically, they used over 1,400 unique languages to coach self-supervised models on over 500,000 hours of voice data, roughly five times more languages than any previous effort.
The researchers employed pre-existing benchmark datasets like FLEURS to evaluate the performance of models trained on the Massively Multilingual Speech data. Using a 1B parameter wav2vec 2.0 model, they trained a multilingual speech recognition system on over 1,100 languages. The performance degrades barely because the variety of languages grows: The character mistake rate only goes up by roughly 0.4% from 61 to 1,107 languages, while the language coverage goes up by nearly 18 times.
Comparing the Massively Multilingual Speech data to OpenAI’s Whisper, the researchers discovered that models trained on the previous achieve half the word error rate. At the identical time, the latter covers 11 times as many languages. This illustrates that the model can compete favorably with the state-of-the-art in voice recognition.
The team also used their datasets and publicly available datasets like FLEURS and CommonVoice to coach a language identification (LID) model for greater than 4,000 languages. Then it tested it on the FLEURS LID challenge. The findings show that performance continues to be excellent even when 40 times as many languages are supported. Additionally they developed speech synthesis systems for greater than 1,100 languages. The vast majority of existing text-to-speech algorithms are trained on single-speaker voice datasets.
The team foresees a world where one model can handle many speech tasks across all languages. While they did train individual models for every task—recognition, synthesis, and identification of language—they consider that in the long run, a single model will give you the chance to handle all of those functions and more, improving performance in every area.
Try the Paper, Blog, and Github Link. Don’t forget to affix our 22k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the most recent AI research news, cool AI projects, and more. If you could have any questions regarding the above article or if we missed anything, be at liberty to email us at Asif@marktechpost.com
🚀 Check Out 100’s AI Tools in AI Tools Club
Tanushree
” data-medium-file=”https://www.marktechpost.com/wp-content/uploads/2020/10/Tanushree-Picture-225×300.jpeg” data-large-file=”https://www.marktechpost.com/wp-content/uploads/2020/10/Tanushree-Picture-768×1024.jpeg”>
Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Bhubaneswar. She is a Data Science enthusiast and has a keen interest within the scope of application of artificial intelligence in various fields. She is enthusiastic about exploring the brand new advancements in technologies and their real-life application.