A leading multinational technology company teamed up with us to help develop an automatic speech recognition (ASR) system designed from the ground up to specifically cater to children’s applications.
You might not be surprised to learn that most speech recognition systems are designed with adult speakers in mind. To date, the nuances and idiosyncrasies of children’s speech have rarely been built into speech-driven applications for children’s use, rendering them unable to successfully process interactions with a younger audience.
For one leading multinational technology company, this was the precise situation which needed to be addressed. The business had discovered that its speech recognition system, originally trained with adult speech data, had not taken into account all of the differences in how children speak, making it ineffective for use in applications designed for children.
Children typically speak with higher-pitch frequencies, and greater temporal and spectral variability – irregularities, hesitations, and mispronunciations (for example “uh,” “um,” and “fwoggy” instead of “froggy”).
The company addressed the shortfall by building a new automatic speech recognition (ASR) system for North American English, designed from the ground up to specifically cater for children’s applications.
The tech firm approached us for help with the product based on our global industry reputation for expertise in languages, transcription and speech recognition systems. The client team asked first for guidance on the new project, and then for help with collecting and transcribing the ideal range of children’s speech data across a range of demographics. The ASR’s primary purpose was for use with educational technology applications. We provided help and guidance via its team of highly skilled linguists, which developed scripts for the target education-related speech needs. This included an appropriate range of numbers, key words, short phrases, and short educational sentences.
In its entirety, the project scope covered:
- Recruiting and working with 400 child speakers
- Targeting a cross section of required demographics: 50% Caucasian, 40% African American, 10% Latino
- Data collection and transcription
- Engaging native speakers of US English with a range of regional dialects including: Northeast, Midwest, South, and West
Working with us allowed the multinational technology company to meet its objectives for an ASR that specifically caters to children’s speech— within its desired time frame and on budget.
We successfully managed the collection and transcription of 105 hours of audio—totalling 60,000 utterances—which helped the client design, build and deliver the ASR it needed to take to market.
The company has since been able to take the acoustic models built into its new ASR and apply it to a range of North American English edutainment platforms and apps specifically designed for children.
One of our key recommendations for this project was regarding which age groups to focus on. The client had originally identified that data collection should focus on 4-to-9-year olds to best meet its needs in the edutainment space. However, our linguists recommended that focusing on two age groups—4-to-7 and 8-to-14-year-olds—along with other demographic requirements, would ensure optimal coverage, which proved to be the case.
We were also able to recruit a large number of participants for the project on relatively short notice. We brought on board an accompanying “family and friends” network including schools and church groups to help recruit interested parents who were happy to consent to their child’s participation in the project. This meant that parents were comfortable with our respectful and communicative process of recruiting minors for data collection purposes, helping the project to achieve a more successful and seamless end result.
Lastly, we demonstrated its experience in working with children for transcription purposes, which helped ensure an easier outcome within the desired time frame. Recording children, especially 4-to-9-year-olds, can be a tricky prospect. By deploying supervisors used to working with children, utilizing images in conjunction with text, and keeping recording sessions short but productive, we ensured a successful delivery for its global tech client.