Robust speech recognition in AR

Robust speech recognition in AR

Technology News |
By Wisse Hettinga

‘This work opens new avenues towards robust speech-driven AR experiences, paving the way for enhanced communication across countless applications’ – from Google Research

Acoustic room simulations allow the training of robust sound separation models for speech recognition on AR Glasses with minimal amounts of real data.

As augmented reality (AR) technology becomes more capable and widely available, it has the potential to assist in a wide range of everyday situations. As we have shared previously, we are excited about the potential for AR and are continually developing and testing new technology and experiences. One of our research directions explores the potential for how speech models could transform communication for people. For example, in our previous Wearable Subtitles work, we augmented communication through all-day speech transcription, the potential of which is demonstrated in multiple user studies with people who are deaf and hard-of-hearing, or for communication across different languages. Such augmentation can be especially helpful in group conversations or noisy environments where people may encounter difficulty distinguishing what others say. Hence, accurate sound separation and speech recognition in a wearable form factor are key in offering a reliable and valuable user experience.

While having the potential to unlock many critical applications, speech recognition on wearables is challenging, especially in noisy and reverberant conditions. In this work, we quantify the effectiveness of using a room simulator to train a sound separation model used as a speech recognition front end. Using recorded IRs on a prototype in different rooms, we demonstrate that simulated IRs help improve speech recognition by (a) greatly increasing the amount of available simulated IRs, (b) by leveraging microphone directivity, and (c) by merging with a small number of measured IRs.

Simulation is a powerful tool for developing speech recognition systems for wearables. Our key takeaways for practitioners are:

      • Realistic acoustics modeling can significantly reduce the amount of real-world data needed.
      • Supplementing even limited real-world data with simulations provide vast gains in performance.

This work opens new avenues towards robust speech-driven AR experiences, paving the way for enhanced communication across countless applications.

If you enjoyed this article, you will like the following ones: don't miss them by subscribing to :    eeNews on Google News


Linked Articles