There are three common types of far field microphone solutions: single microphones, linear arrays, and circular arrays.
- Single microphone solutions rely heavily on algorithm to process audio and while they are lower cost and require less hardware, they don't tend to perform as well as their multi-microphone counterparts.
- Linear arrays arrange the microphones in a straight line consisting of two or more microphones.
- Circular arrays can have three or more mics that arrange themselves along a circumference of a circle, sometimes with a microphone at the center.
Reliable performance can come from any of these if they're used for the appropriate application and tuned and tested before being deployed.
The primary consideration for selection on the type and number of microphones for far field is the application. For example, Amazon provides some guidance for Alexa Voice Service enabled products and their certification. Push to talk devices for use at arm's length only require a single microphone. Hands Free devices that might be used within close proximity, let's say a meter or two, can use two microphones. Far-field (hands free) devices for their service may require more than two microphones.
For circular versus linear arrays, a typical consideration is the angle of arrival of the voice signal. Circular arrays have the benefit of allowing for 360-degree coverage for the angle of arrival of the speaker. For example, the Amazon Echo can be placed in the middle of the room and accept voice from all around it. The Echo Look, however, has more directional microphones as it expects the voice signal to come from in front of the camera. The Google Home, while it uses only two mics, places them on a plane so that its algorithm can interpret the direction of arrival.
For hardware makers, this means if their device is going to be mounted on a wall, a linear array of only two mics might be sufficient for them to do beamforming. However, if the device is going to be placed on a conference room table, then a circular array is going to allow for better functionality.
When we were considering adding far field capabilities to our original product, the Ubi voice-activated ubiquitous computing device, it was tempting to try to implement our algorithms on low-cost DSP chips. Part was out of necessity – at that time there weren't readily available chips on the market – and part out of lack of experience. While there are thousands of academic papers on beamforming and many openly available algorithms, implementing these can be extremely difficult. We spent a small fortune on fruitless DSP projects that ultimately degraded speech recognition performance as removing artifacts from the sound data sent to Google or Alexa can confuse the services.
Today, there's a lot more choice of readily implementable DSP technology; Conexant, Cirrus Logic, Microsemi, Xmos, Intel, and others have chips on the market ready to implement, along with a good set of tools for tuning the DSP for the desired application. Likewise, there are at least a dozen other companies looking at putting out far-field technology over the next year.
This advance of the technology is driving down the cost, both for the chips as well as for implementation. When we were looking at the technology four years ago, it would have cost us a multimillion dollar NRE fee along with at least $10 in BOM cost. Now, the NRE fees might be in the tens of thousands of dollars and the BOM cost well under $5 for far-field. With this knowledge, it seems to make more sense to use off-the-shelf chips than try to implement ones on DSP algorithm.
The other opportunity with DSP chips is implementing low-power wake word engines, such as an "Alexa" trigger word. For battery-operated devices, this means that they might be able to last for days without a charge in a standby mode, waiting to be woken up by a user and then sending a signal to a main processor. Since licensing wake word technology is typically on a per device basis regardless of the number of wake words, it also opens the door for companies to add offline voice interaction modes to the device – a wake word followed by vocal commands, providing even more options to the device maker.
Because of the advances of readily available off-the-shelf DSPs for far-field voice recognition that also include the ability to load local wake word engines, makers of consumer hardware who are looking to add voice would be wise to consider using these tools rather than trying to implement their own. Using these tried components reduces the risk of a costly in-house DSP development project.
Leor Grebler is CEO of Unified Computer Intelligence Corp. (Toronto, Ontario). UCIC helps companies add voice interaction to their hardware product. Its initial product - Ubi – The Ubiquitous Computer – is a voice-activated computing device that offers access to information and control of home automation devices and was the first product to offer natural environment-based voice interaction.
Related links and articles: