The majority of systems in the age of AI voice assistants rely significantly on the cloud for speech recognition, which raises issues with data security, privacy, and the requirement for continuous internet connectivity. AI-powered offline voice assistants are utilized to solve these issues. In automotive systems, where dependable voice control over in-car activities must continue to be available even in locations with spotty cellular coverage, such offline voice assistants are especially important.
Background: AI-driven Offline Voice Assistant In our setup, we have a Linux-based Raspberry Pi and an Android-based CAVLI CQS290 EVK. The voice assistant is running on the RPi, and it’s handling the voice input data from the user controls the infotainment system running on the EVK.The…
The majority of systems in the age of AI voice assistants rely significantly on the cloud for speech recognition, which raises issues with data security, privacy, and the requirement for continuous internet connectivity. AI-powered offline voice assistants are utilized to solve these issues. In automotive systems, where dependable voice control over in-car activities must continue to be available even in locations with spotty cellular coverage, such offline voice assistants are especially important.
Background: AI-driven Offline Voice Assistant In our setup, we have a Linux-based Raspberry Pi and an Android-based CAVLI CQS290 EVK. The voice assistant is running on the RPi, and it’s handling the voice input data from the user controls the infotainment system running on the EVK.The voice input is converted to text using the OpenAI Whisper ASR model. Hotspot, airplane mode, Wi-Fi, Bluetooth, lights, and other features are controlled by the voice assistant using the transformed text.
The Problem Study There are several reasons why Automatic Speech Recognition (ASR) may make mistakes. For instance, we often got enable wife, enable life, or just Wi-Fi when we said enable Wi-Fi. As seen in the pseudocode below, we first attempted to solve these problems by employing complex layered IF condition,
IF response contains “wif” OR “wi-fi” OR “life”: IF response contains “on” OR “enable”: CALL set_wifi_state(True) ELSE IF response contains “off” OR “disable”: CALL set_wifi_state(False) ELSE: PRINT “Unknown audio request: response” .. continues..
Although this is functional, it lacks elegance and scalability. Rather than being a solution, it is more of a workaround.
We were also required to support “Natural language commands” in the second iteration of our system. Thus, the Wi-Fi would be activated if a user spoke any of the following. **Turn on Wi-Fi
Enable Wi-Fi
Please turn on the Wi-Fi
Wi-Fi on
Wi-Fi enable, please!**
To do this, we briefly considered using a full-fledged LLM-based agent with tool calling.
Thus, we were faced with the following issues.
incorrect identification of terms unique to a given domain.
Background noise introduces words.
A human speaker introduces uncertainties.
commands in natural language.
The first issue was solved by fine-tuning the model with domain specific terms. (How we did this is discussed in detail in a separate article Lessons from Fine-Tuning Whisper for Tamil Voice Commands.)
The Solution Study : We must comprehend what word embeddings are in order to comprehend how Sentence Transformers operate and how we used them to solve our difficulty.
Embedding Words A word is typically represented in computers as a series of ASCII codes, each of which stands for a character. This is helpful for printing the word, obtaining the word from user input, etc. However, this depiction does not aid in comprehending the word’s meaning. Therefore, there is no hint that “dog” and “pup” are comparable in any way, and their ASCII representations are entirely distinct.
For LLMs to process words, they are represented as token embeddings, which are very similar to word embeddings that we have discussed. These token embeddings can have a dimension in the range 2048 to 12288, which is represented in below format like 2-dimensional building :
Sentence transformers
We now know the fundamentals of word embeddings. Let’s attempt to comprehend sentence embeddings. Sentence embeddings are fixed-length vector representations of whole sentences. Even when phrased differently or with minor pronunciation problems, these embeddings are especially helpful for comprehending natural language inputs since they capture the content of a statement.
In this region, sentences with similar meanings are closer to one another. “On the wifi” and “Enable WAP,” for instance, may be closer than “Enable WAP” and “Turn on the Bluetooth.” The following diagram uses a two-dimensional sentence embedding to illustrate this concept.
Explore the complete solution study in this article quickly now - article sentence transformer