Exclusive: Speech Recognition AI Learns Industry Vocabulary with aiOla's Novel Approach

WhatsApp Group Join Now
Telegram Group Join Now
Instagram Group Join Now

Just don't miss OpenAI, Chevron, Nvidia, Kaiser Permanente, and Capital One leaders at VentureBeat Transform 2024. Gain essential insights into GenAI and expand your network at this exclusive three-day event. learn more


Speech recognition is an important part of multimodal AI systems. Most enterprises are racing to implement the technology, but even with all the advances to date, many speech recognition models can fail to understand what a person is saying. Today, aiOla, an Israeli startup that specializes in this field, took a big step toward solving this problem by announcing an approach that would teach these models to understand industry-specific words and phrases. Is.

The development increases the accuracy and responsiveness of speech recognition systems, making them more suitable for complex enterprise settings – even in challenging acoustic environments. As an initial case study, the startup adapted OpenAI's popular Whisper model with its own technique, reducing its word error rate and improving overall detection accuracy.

However, it says it can work with any speech rack model, including MetaK's MMS model and proprietary models, unlocking the ability to upscale even high-performance speech-to-text models. .

Jargon problem in speech recognition

Over the past few years, deep learning on hundreds of thousands of hours of audio has enabled the rise of high-performance automatic speech recognition (ASR) and transcription systems. OpenAI's Whisper, one such breakthrough model, made headlines in the field with its ability to match human-level robustness and accuracy in English speech recognition.


Countdown to VB Transform 2024

Join enterprise leaders in San Francisco July 9-11 for our flagship AI event. Connect with peers, explore the opportunities and challenges of generative AI, and learn how to integrate AI applications into your industry. Register now


However, since its launch in 2022, many have noted that despite being good as a human listener, Whisper's recognition performance drops when applied to audio from complex, real-world environmental situations. can come Imagine safety warnings from workers with the constant noise of heavy machinery in the background, activation signals from people in public spaces or commands with specific words and terms such as those commonly used in medical or legal domains. are

Most organizations using the latest ASR models (Whisper and others) have attempted to address this issue with training tailored to the unique needs of their industry. The approach works but can easily strain a company's financial and human resources.

“Fine-tuning ASR models takes days and thousands of dollars – and that's only if you already have the data. If you don't, it's a whole other ballgame. Collecting the audio data and Labeling can take months and cost millions of dollars. For example, if you want to fine-tune your ASR model to identify 100 industry-specific terms and terminology, you'll need different settings. “I would need thousands of audio examples that would all need to be copied manually. If you then want to add just one new keyword to your model, you'll have to retrain on new examples.” Gil Hetz, VP of research at aiOla, told VentureBeat.

To solve this, the startup took a two-step “contextual bias” approach. First, the company's AdaKWS keyword spotting model identifies domain-specific and personalized jargon (predefined in jargon lists) from a given speech sample. Then, these identified keywords are used to prompt the ASR decoder, and guide it to include in the final transcribed text. This increases the overall speech recognition capability of the model, adapting it to correctly detect jargon or the terms in question.

In initial tests of keyword-based context bias, aiOla used Whisper – the best-in-class model – and tested two techniques to improve its performance. The first, called KG-Whisper or Keyword-Guided Whisper, tuned the entire set of decoder parameters, while the second, called KG-Whisper-PT or Prompt Tuning, used only 15K trainable parameters—this This way it is more efficient. In both cases, the adaptive models were found to outperform the original whisper baselines on different datasets, even in challenging acoustic environments.

“Our new model (KG-Whisper-PT) significantly improves word error rate (WER) and overall accuracy (F1 score) compared to Whisper. When tested on the medical dataset highlighted in our research, it achieved a higher F1 score of 96.58 compared to Whisper's 80.50, and a lower word error rate of 6.15 compared to Whisper's 7.33, Hertz said.

Most importantly, the approach works with different models. aiOla used it with Whisper, but enterprises can use it with any other ASR model they have – from Meta's MMS and proprietary speech-to-text models – to enable a bespoke recognition system, with zero retraining overhead. With the head. All they have to do is provide their industry-specific keyword list to the keyword spotter and update it periodically.

“The combination of these models provides the full capabilities of ASR that can accurately identify pollen. This allows us to quickly adapt to different industries without retraining the entire system. This is essentially is a zero-shot model, capable of making predictions without seeing a specific example during training,” explained Hertz.

Time savings for Fortune 500 enterprises

With its adaptability, this approach can be useful for a variety of industries involving technical terms, from aviation, transportation and manufacturing to supply chain and logistics. AiOla, for its part, has already begun deploying its adaptive model with Fortune 500 enterprises, increasing their efficiency in managing generation-heavy processes.

“One of our customers, a Fortune 50 global shipping and logistics leader, required daily truck inspections prior to delivery. Previously, each inspection took about 15 minutes per vehicle. Our new model powered one With an automated workflow, this time was reduced to less than 60 seconds per vehicle. Similarly, one of Canada's leading grocers used our models to inspect products and meat temperatures as required by the Department of Health. This resulted in time savings estimated to reach 110,000 hours annually, more than $2.5 million in projected savings, and a 5X ROI,” noted Hertz.

aiOla has published the research with the hope that other AI research teams will build on its work. However, as of now, the company is not providing API access to the adaptive model or releasing weights. The only way enterprises can use it is through the company's product suite, which operates on a subscription-based pricing structure.

WhatsApp Group Join Now
Telegram Group Join Now
Instagram Group Join Now

Leave a Comment