In the early days of online translation, the software was clunky, and the translations directly translated every word, often leading to serious misunderstandings in the nuances of language. Microsoft Translator has made translation easier, more accurate, faster, and made synchronous multi-language communication possible.
Microsoft Translator started by working with the world’s most frequently spoken languages. Today, they’re adding more and more languages. Less common languages are being added to Microsoft Translator regularly and are being used to teach younger generations, to preserve languages that are disappearing, and to make knowledge access equitable and accessible, no matter what language you speak.
Microsoft Translator, powered by Azure Cognitive Services, uses AI technology to parse language and translate it into another language. To do this, they need a large, accurately annotated training dataset to prepare the translator model for each language.
Microsoft Translator struggled to get the size of dataset they needed for some of the less frequently spoken or cataloged languages. Creating a dataset takes time, knowledge, and resources. Translating to languages that have a different alphabet requires phonetic similarity & transliteration first which can be done with expert staff and linguists. You must find fluent speakers, collect data points, annotate each data point, and run quality assurance tests to ensure accuracy.
To speed up their time to market, Microsoft reached out to outside sources to collect and prepare the data they needed.
Appen was the vendor of choice that Microsoft Translator reached out to work with on this language project. We provided the expertise, resources and creative solutions needed to create translated datasets from rare languages and run the necessary quality checks.
Our process included working with local resources to source translations from fluent speakers. We collected data, annotated the data by transcribing and translating each data piece, and evaluated the model outputs for quality assurance and accuracy. We developed a service that would allow Microsoft to generate multiple translations for gender-ambiguous source sentences – addressing translation and bias.
Our work for Microsoft Translator encompassed three of the data for the AI lifecycle stages: data sourcing, data preparation, and model evaluation by humans. By completing this work, we helped Microsoft Translator get the data they needed at the highest possible quality, and on time.
As a result of our partnership, Microsoft Translator now has 110 languages available for consumers to use for translations and working in other languages. Appen supported the data gathering process for 108 of those 110 languages.
While there are 110 available languages, some of the newer and less commonly spoken languages include:
- Dari & Pashto
- Literary Chinese
- Marathi, Gujarati, Punjabi, Malayalam, and Kannada
- European Portuguese and Brazilian Portuguese
The links lead to Microsoft blog posts that go in-depth about the language and the process in adding it to the Microsoft Translator AI.
No matter our client or the size of the project, we’re proud to create the highest quality data possible so that we’re part of the solution of making AI better. Representative data is how we make AI more ethical. Our work with Microsoft Translator to represent all languages, not just those with the most speakers, is part of our goal of making AI better and more ethical.