(Source: wladimir1804 - stock.adobe.com)
‘It is not only about what you say. It is also about how you say it.’ This old-age adage quite aptly sums up the need for human beings to communicate effectively with each other. The necessity of humans to interconnect with one another through voice and sounds has presented a future where communication with machines has become inevitable.
A key enabler for the increasing adoption of voice communication has been accelerated with the expansion of the Internet of Things (IoT) and artificial intelligence (AI). Integration of AI at the endpoint— combined with advances in voice analytics—is changing the availability of products, and the consumption of product experiences are giving rise to a new ecosystem of companies that are participants and enablers of these products. Intelligent endpoint solutions are making it possible to implement both online and offline systems, reducing reliance on always-on internet/cloud connections. This, in turn, is creating new opportunities to solve many challenges related to real-time voice analytics across several consumer and industrial applications. The advances in psycholinguistic data analytics and affective computing make allowance for inferring emotions, attitudes, and intent with data-driven voice modeling. With the voice medium becoming a natural way for humans to interact, it will lead to improvements in measuring intent from voice recognition and voice analytics.
Voice user interfaces (VUIs) allow the user to interact with endpoint systems through voice or speech commands. Despite mass deployments across a wide range of applications, VUIs have some limitations.
In this blog, Renesas Electronics addresses these challenges by using state-of-the-art microcontrollers and partner-enabled intelligent voice processing algorithms, which makes it easier for product manufacturers to integrate highly efficient voice commands. Renesas Electronics provides general-purpose MCUs enabling VUI integration without compromising performance and power consumption.
To make the experience compelling for the user, devices need to be equipped with several components to ensure robust voice recognition.
One of the most significant features of a voice-enabled device is its ability to identify speech commands from an audio input. The speech command recognition system on the device is activated by the wake word, which then takes the input, interprets it, and transcribes it to text. This text ultimately serves the purpose of the input or command to perform the specific task.
Voice activity detection (VAD) is the process that distinguishes human speech from the audio signal and background noise. VAD is further utilized to improve the optimization of overall system power consumption otherwise; the system needs to be active all the time, resulting in unnecessary power consumption. The VAD algorithm can be subdivided into four stages (Figure 1):
Figure 1: The block diagram specifies the four stages of the VAD algorithm: noise minimization, segregation, classification, and response. (Source: Renesas Electronics)
The Renesas RA voice command solution built on the RA MCU family and partner-enabled voice recognition MW boasts a robust noise reduction technique that helps in ensuring high accuracy in VAD. In addition, Renesas can help to address some of the key voice command features outlined below:
Keyword spotting systems (KWS) are one of the key features of any voice-enabled device. The KWS relies on speech recognition to identify the keywords and phrases. These words trigger and initiate the recognition process at the endpoint, allowing the audio to correspond to the rest of the query (Figure 2).
Figure 2: The diagram illustrates the keyword spotting process, which relies on speech recognition to identify the keywords and phrases, the identified keywords and phrases triggering and initiating the recognition process at the endpoint, and allowing the audio to correspond to the rest of the query. (Source: Renesas Electronics)
To contribute to a better hands-free user experience, the KWS is required to provide highly accurate real-time responses. This places an immense constraint on the KWS power budget. Therefore, Renesas provides partner-enabled high-performance optimized machine learning (ML) models capable of running on advanced 32-bit RA microcontrollers. They come with pre-trained DNN models, which help in achieving high accuracy when performing keyword spotting.
Speaker identification, as the name suggests, is the process of identifying which registered speaker has the given voice input (Figure 3). Speaker recognition can be classified as text dependent, text independent, and text prompted. To train the DNN for speaker identification, individual idiosyncrasies such as dialect, pronunciation, prosody (rhythmic patterns of speech), and phone usage are obtained.
Figure 3: Speaker identification system block diagram illustrates the process of training the DNN for speaker identification and individual speech idiosyncrasies. (Source: Renesas Electronics)
Spoofing is a type of scam where the intruder attempts to gain unauthorized access to a system by pretending to be the target speaker. This can be countered by including anti-spoofing software to ensure the security of the system. The spoofing attacks are usually against Automatic Speaker Verification (ASV) systems (Figure 4). The spoofed speech samples can be generated using speech synthesis, voice conversion, or by just replaying recorded speech. These attacks can be classified as direct or indirect depending on how they interact with the ASV system.
Figure 4: Block representation of an automatic speaker verification. (Source: Renesas Electronics)
Accent recognition in English-speaking countries is a much smoother process due to the availability of training data, hence accurate predictions. The downside for organizations operating in countries where English is not the first language is less precision with speech recognition due to the availability of a limited amount of data. An inadequate amount of training data makes building conversational models of high accuracy challenging.
To overcome the accent recognition issue, Renesas offers a VUI partner-enabled solutions that support more than 44 languages, making it a highly adaptable speech recognition solution that can be used by any organization worldwide.
The Give Voice to Smart Products blog was originally published on www.renesas.com and is republished here with permission.
Renesas Electronics is a semiconductor company with an outstanding portfolio of global market-leading products. Renesas has the technology and capabilities to deliver almost everything required in an age focusing on human needs, including security technologies, miniaturization and power-saving technologies, networking, and interface technologies. Renesas aims to stay one step ahead and to be the true intelligent chip solution provider—the world leader in its field.