This post was co-authored by Chun Ming Chen, Microsoft's Technical Program Manager, and Max Cazenadi, Senior Data Scientist, along with Lui Huang, Nicholas Cao, and James Tealy, students at the University of California at Berkeley..
This blog post is about the UC Berkeley Virtual Tutor Project and the speech recognition technologies that were tested as part of that effort. We share best practices in machine learning and artificial intelligence techniques in selecting models and engineering training data for speech and image recognition. These speech recognition models, integrated with immersive games, are currently being tested in California middle schools.
idea, context
The University of California, Berkeley has a new program founded by alum and philanthropist Coleman Fung Fung Fellowship. In this program, students develop technology solutions to address educational challenges such as enabling disadvantaged children to support their education. The solution involves creating a virtual tutor that listens to children and interacts with them while playing educational games. The games were developed by a technology company founded by Coleman. Blue Goji. This work is being done with the support of Partnership for a Healthier AmericaA nonprofit organization chaired by Michelle Obama.
GoWings Safari, a safari-themed educational game, is enabled with a virtual tutor that interacts with the user.
One of the students working on the project is a first-generation UC Berkeley graduate from Malawi named James Tayali. James said: “This safari game is important for children who grow up in environments that expose them to childhood trauma and other negative experiences. Such children need to focus and excel academically. Struggling. Combining the educational experience with interactive, immersive games can improve their learning.”
As an orphan from Malawi who struggled to focus in school, this is an area James can relate to. James had to deal with family problems and work part-time jobs to support himself. Despite humble beginnings, James worked hard and attended UC Berkeley with scholarship support from the MasterCard Foundation. Now he's paying it forward to the next generation of kids. James added, “This project can help children who share stories like mine to let go of traumatic past experiences, focus on their current education and Give them hope for their future.”
James Tayali (left), UC Berkeley public health major Class of 2017 alum and Coleman
Fung (right), posing with the Safari game displayed on the monitor screen.
The fellowship program was taught by Microsoft Search and Artificial Intelligence Program Manager, Chen Ming, who is also an alumnus of UC Berkeley. He also advised the team building the virtual tutor, including James Tayali, who worked in public health and served as the team's product designer. Luyi Huang, an Electrical Engineering and Computer Science (EECS) student who led the programming tasks. and Nicholas Cao, an applied math and data science student, who performed data collection and analysis. Most of his work was done remotely in three locations – Redmond, WA, Berkeley, CA, and Austin, TX.
UC Berkeley Fung Fellowship students Lui Huang (left) and Nicholas Cao (right).
Chen Ming teaches speech recognition and artificial intelligence lectures to UC Berkeley students.
Insights from the Virtual Tutor Project.
This article shares technical insights from the team in some areas:
- Model selection strategies and engineering considerations for eventual real-world deployment, so others doing the same thing already have more confidence in investing in a model that fits their scenario. Fits perfectly.
- Training data engineering techniques that are useful references not only for speech recognition, but also for other scenarios such as image recognition.
Model Selection – Selecting a speech recognition model
We explored speech recognition technologies from Carnegie Mellon University (CMU), Google, Amazon and Microsoft and finally zoomed in on the following options:
1. Bing Speech Recognition Service
Microsoft Bing's paid speech recognition service showed 100% accuracy despite having to wait 4 seconds to get results back from Bing's remote servers. Although the accuracy is impressive, we did not have the flexibility to adapt this black-box model to other speech tones and background noise. One possible solution is to process the output from a black box (ie, post-processing).
2. CMU Open Source Statistical Model
We also explored other free, high-speed speech recognition models that are accessed locally rather than on a remote server. Ultimately, we chose an open-source statistical model, Sphinx, which had an initial accuracy of 85% and improved the latency of the Bing Speech API from 4 seconds to 3 seconds. Unlike Bing's black box solution, we can look inside the model to improve accuracy. For example, we can reduce the word search space required for a dictionary lookup or adapt the model with more speech training data. Sphinx has a 30-year legacy, originally developed by CMU researchers who coincidentally are now at Microsoft Research (MSR)—among them Xuedong Huang, Microsoft Technical Fellow, Fil Alleva, Partner Research Manager, and Hsiao-Wuen. Including Hon, Corporate Vice. President of MSR Asia.
Defaults to CMU's open-source speech recognition model, human characteristics and linguistic structure.
3. Azure Deep Learning Model
Students were also connected to the Boston-based Microsoft Azure team at the New England Research and Development (NERD) Center. With access to NERD's work on an Azure AI product called Data Science Virtual Machine Notebook, fellowship students achieved a virtual tutor speech accuracy of 91.9%. Moreover, the average model execution time between NERD's and CMU's model is the same at 0.5 seconds per input speech file. An additional prototype deep learning model was developed by NERD based on the winning solution of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2017 challenge. This model can further increase classification accuracy and scale to large training datasets.
Machine learned features with Azure's deep learning model.
NERD's model accuracy plots against the number of full passes on the y-axis
Training data (ie distance) on the x-axis. The final accuracy estimate on the measurement set is 91.9%.
Data Engineering Training
The lack of audio training data was an obstacle to maximizing the potential of the deep learning model. More training data can always improve CMU's speech recognition model.
1. Solve training and testing data matching problems.
We downloaded simulated speaker audio files from the public web and collected audio files from UC Berkeley volunteers at a sampling rate of 16kHz. Initially, we observed that more training data did not increase test accuracy on the Oculus microphone. This problem was due to a mismatch in sampling rate between the training data (16kHz) and the Oculus microphone input (48kHz). Once the input was sampled, the improved Sphinx mode had better accuracy.
A visual representation (i.e. spectrogram) of the frequency spectrum of a sound varying with time between 16kHz sampling (top) and 48kHz (bottom). Note the lower 48kHz sampled spectrogram has better resolution.
2. Synthetic speaker audio
Data biases due to speaker gender and accent must be balanced by increasing the quality and quantity of training data. To solve this, we imported synthesized male/female audio samples from translators like Bing. We can train our model using these newly synthesized audio samples in combination with our existing data. However, we found that the synthesized audio lacked the variation of the naturally occurring zigzag feature in the human voice. It was “very clean” to accurately represent the natural human voice in a live setting.
Spectrogram comparison between synthesized voice (top) and human speaker's voice (bottom).
3. Combination of background noise and speaker signal
Another issue is that there are various background sounds that the Oculus microphone is unable to automatically reject. This interferes with the model's ability to separate background noise from the speaker's signal. To solve this, we mixed an audio sample with multiple background sounds.
The y-axis is the “Amplitude” normalized dB scale (where -1 is no signal
and +1 is the strongest signal) represents loudness in audio.
This provided a large amount of audio samples and allowed us to customize the model for the virtual tutor's live environment. With additional synthesized samples, we trained a more accurate model as shown in the confusion matrix below. This matrix shows examples of tests where the model is confused by the mismatch between the predicted class column and the ground truth row. Correct predictions are shown along the diagonal line of the matrix. Confusion matrices are a good way to visualize which classes need improvement in targeted training data.
Confusion matrix before combining background noise and speaker signal as new training data for CMU's Sphinx model. The accuracy of the model is 93%.
Confusion matrix after combining background noise and speaker signal as new training data for CMU's Sphinx model. The accuracy of the model is 96%.
There were some problems with the synthesized noise. When we overlaid the clean signal and the synthesized noise without any signal adjustment, we discovered some outliers. These outliers occurred because the noise was more prominent than the speaker's signal.
4. Correction of signal-to-noise ratio
To compensate for the above problem, we adjusted the relative decibel (dB) level ratio between the two audio files. By using root mean square (RMS) to estimate the dB levels of each audio file, we were able to suppress noisy audio that gave priority when training and predicting the sound of the speakers. . Through a series of tests we determined that the average dB level of noise is about 70% of our average clean audio dB level. This allows us to maintain 95% accuracy when testing on a redundant training and testing set. Anything over 80% decreases accuracy at an increasing rate.
Waveform plot showing noise at 70% (top) and 100% (bottom). The y-axis is “magnitude” normalized.
The dB scale (where -1 is no signal and +1 is the strongest signal) represents loudness in audio.
Spectrogram plot showing noise at 70% (top) and 100% (bottom).
Note that there are more blue and pink areas in the noise below at 100%.
Abstract
The story of the UC Berkeley Virtual Tutor Project began in the fall of 2016. We first tested a variety of speech recognition technologies and then explored a range of data engineering training techniques. Currently, our speech recognition models have been integrated with the game and are being tested in middle schools in California.
For those of you looking to add speech recognition capabilities to your projects, you should consider the following options based on our findings:
- For ease of integration and high accuracy, try Bing Speech API. It allows you to use 5000 free transactions per month.
- For faster end-to-end response times and model customization to improve accuracy for specific environments, try CMU's statistical model Sphinx.
- For scenarios where you have access to a lot of training data (eg more than 100,000 rows of training examples), Azure's deep learning model May be a better option for both speed and accuracy.
Chun Ming, Max, Louie, Nick and James