When it comes to creating a delightful experience for the user, there can be a lot of challenges that block the ability for someone to have a reliable interaction. Reverberation, ambient noise, and focusing the listening on the speaker have created insurmountable problems in the past. Today, there are a handful of turnkey technology providers who provide far-field digital signal processing solutions and quite a few more are being introduced, driving down the costs. Hardware makers are still left with the daunting task of deciding what types of microphones to implement, how many, how to place them, and a number of other factors that must be addressed in making far-field voice interaction work well.
While there are many moving parts to voice interaction that can make or break an experience, such as; latency, accuracy of the natural language processing, and the synthesized speech response, one of the most critical pieces is accurate speech recognition.
Far-field DSPs aim to improve the accuracy of speech recognition in three ways:
- De-reverb. Sound waves reflecting off of objects and surface in a room can arrive at the microphone at different times, creating an echo or reverberation. The further a device is from the speaker, the more likely the microphones will pick up an echo. Multi microphone far field solutions look for a match between different sound signals and knowing the distance between the microphones, can buffer, merge, or just ignore echoing signals. These echoes tend to confuse speech recognition. If you were to look a spectrogram of speech, with reverb, the image would appear smudged, making it hard for systems to identify words and sounds. Some single mic DSP solutions attempt de-reverb through software algorithm versus hardware processing.
- Voice Activity Detection and Active Gain Control. Essentially, this filter runs on the DSP and listens for something it can identify as human voice. It will then increase the gain (volume) of the audio signal and try to ignore the rest of the noise in the environment. This makes for a better signal to noise ratio and easier for speech recognizers to identify sounds and words. Actual deployment of these filters can range from simple to extremely complex and can be one of the reasons why DSP code tends to be run on specialized processors.
- Beamforming. Beamforming is the attempt to identify the direction from which a voice signal is arriving and then to ignore all other signals. The result is both the reduction of echo signal as well as an increase in signal-to-noise ratio. Typically, this algorithm requires at least two microphones and can be made more accurate with an array of microphones in different orientations. The narrowness of the beam is also determined by the number of microphones, with more microphones leading to more granularity.
Next: Three microphone options
There are three common types of far field microphone solutions: single microphones, linear arrays, and circular arrays.
- Single microphone solutions rely heavily on algorithm to process audio and while they are lower cost and require less hardware, they don’t tend to perform as well as their multi-microphone counterparts.
- Linear arrays arrange the microphones in a straight line consisting of two or more microphones.
- Circular arrays can have three or more mics that arrange themselves along a circumference of a circle, sometimes with a microphone at the center.
Reliable performance can come from any of these if they’re used for the appropriate application and tuned and tested before being deployed.
The primary consideration for selection on the type and number of microphones for far field is the application. For example, Amazon provides some guidance for Alexa Voice Service enabled products and their certification. Push to talk devices for use at arm’s length only require a single microphone. Hands Free devices that might be used within close proximity, let’s say a meter or two, can use two microphones. Far-field (hands free) devices for their service may require more than two microphones.
For circular versus linear arrays, a typical consideration is the angle of arrival of the voice signal. Circular arrays have the benefit of allowing for 360-degree coverage for the angle of arrival of the speaker. For example, the Amazon Echo can be placed in the middle of the room and accept voice from all around it. The Echo Look, however, has more directional microphones as it expects the voice signal to come from in front of the camera. The Google Home, while it uses only two mics, places them on a plane so that its algorithm can interpret the direction of arrival.
For hardware makers, this means if their device is going to be mounted on a wall, a linear array of only two mics might be sufficient for them to do beamforming. However, if the device is going to be placed on a conference room table, then a circular array is going to allow for better functionality.
When we were considering adding far field capabilities to our original product, the Ubi voice-activated ubiquitous computing device, it was tempting to try to implement our algorithms on low-cost DSP chips. Part was out of necessity – at that time there weren’t readily available chips on the market – and part out of lack of experience. While there are thousands of academic papers on beamforming and many openly available algorithms, implementing these can be extremely difficult. We spent a small fortune on fruitless DSP projects that ultimately degraded speech recognition performance as removing artifacts from the sound data sent to Google or Alexa can confuse the services.
Today, there’s a lot more choice of readily implementable DSP technology; Conexant, Cirrus Logic, Microsemi, Xmos, Intel, and others have chips on the market ready to implement, along with a good set of tools for tuning the DSP for the desired application. Likewise, there are at least a dozen other companies looking at putting out far-field technology over the next year.
This advance of the technology is driving down the cost, both for the chips as well as for implementation. When we were looking at the technology four years ago, it would have cost us a multimillion dollar NRE fee along with at least $10 in BOM cost. Now, the NRE fees might be in the tens of thousands of dollars and the BOM cost well under $5 for far-field. With this knowledge, it seems to make more sense to use off-the-shelf chips than try to implement ones on DSP algorithm.
The other opportunity with DSP chips is implementing low-power wake word engines, such as an “Alexa” trigger word. For battery-operated devices, this means that they might be able to last for days without a charge in a standby mode, waiting to be woken up by a user and then sending a signal to a main processor. Since licensing wake word technology is typically on a per device basis regardless of the number of wake words, it also opens the door for companies to add offline voice interaction modes to the device – a wake word followed by vocal commands, providing even more options to the device maker.
Because of the advances of readily available off-the-shelf DSPs for far-field voice recognition that also include the ability to load local wake word engines, makers of consumer hardware who are looking to add voice would be wise to consider using these tools rather than trying to implement their own. Using these tried components reduces the risk of a costly in-house DSP development project.
Leor Grebler is CEO of Unified Computer Intelligence Corp. (Toronto, Ontario). UCIC helps companies add voice interaction to their hardware product. Its initial product – Ubi – The Ubiquitous Computer – is a voice-activated computing device that offers access to information and control of home automation devices and was the first product to offer natural environment-based voice interaction.
Related links and articles: