A couple of days ago, I got me an Amazon Echo. The device does an impressive job at showing me how limited today’s voice interfaces are. That needs to change.
Today’s Voice Assistant
“I’m not sure.” “I don’t know that.” That sort of cluelessness is what I hear from Alexa, the voice assistant of the Amazon Echo, most of the time. This may be unsatisfying, but basically, it’s only a problem of the available knowledge base. Add more substance to Alexa’s backend, and this kind of answers will become much less common. Thus, I don’t want to continue focusing on the lack of knowledge in this article.
Here, I just want to mention that the Google Assistant wins any competition in that regard. Siri has a similarly horrible error rate, although Siri is the oldest technology of the three, so it should theoretically be the most advanced.
Voice Detection Works Perfectly Fine
The actual issue with Alexa and all other voice assistants is a different one. It’s not about what used to be the biggest issue, the voice detection in itself. Thanks to fast cloud connections and sufficient computing power, detection itself is near perfect. Even the differentiation of different languages in one sentence is no longer a problem for the technology.
The next big construction site that limits the success of the voice technology in the form of conversational interfaces is the concept of conversation in itself. Honestly, I have to admit that using my Echo does not really resemble a conversation.
The Interaction Resembles the MS DOS of the Eighties
Alexa reminds me of the early MS DOS. As if I was using command lines, I pull one sentence after another out of the slim can. I always start my commands with the word “Alexa”. The Echo doesn’t respond to other requests. Without “Alexa”, there’s not much going on. This is different for some external skills. I’ll admit that, but the basic functionality is not smooth at all.
I also can’t talk the way I want to. Alexa needs a sentence to be pronounced the way she wants you to. Otherwise, she simply won’t understand the command. The developers have prepared some alternative variants for typical input. But you still need to know them to get Alexa to reply. This is pretty nerdy, and very close to a toy. I guess there’s a reason why setting a timer is one of the most used Echo functions.
The Voice Assistant of Tomorrow
In this article on t3n, I presented Conversational Interfaces as the dialogue systems of the future. In this and this article at Noupe, I looked into storytelling as the most important design element of future generations. I estimated the time until the switch from purely visual to voice-oriented design to be about ten years long. Looking at Alexa, I may want to correct myself to 15 years. At least…
Let’s get back to the previously mentioned big issue in conversational design via voice controls. Complex processes, such as the purchase of a product that can be configured with different options, can not be taken care of given the common usage strategy. Here, an actual conversation with the technology is needed to achieve any results. The voice assistant has to be able to cause conversion. At the very least it shouldn’t be an obstacle to that goal. At best it actually positively supports the process.
I have yet to encounter that voice assistant that keeps me motivated and dedicated. Creating that assistant will be a lot of work for designers. It’s somewhat similar to designing a longer form. Both come with the danger of the user exiting at any time. The determined cancellation rates of common shopping cart strategies prove this. It’s also important to build a connection between the user and the system. At the lowest level, this connection starts with the system not querying data that it already knows.
Context is King: The Voice Assistant Needs a Short Term Memory (At Least)
Let’s imagine talking about Paul from the HR department with a colleague. After the first couple of sentences, we’ll only use “him, he” to talk about Paul, without ever pointing out that we’re still talking about Paul. Throughout the conversation, we’ll build a context that we assume is known in following sentences. This way, we even understand subtle innuendos. Ten minutes later, we’ve probably forgotten our conversation about Paul.
Transferred to the voice assistant, this would mean that it needed some kind of short term memory in order to be able to handle information and context productively for a limited time. With Siri and the Google Assistant, we can already see early stages of this, when it comes to sending a WhatsApp message, for example. Here, the assistant guides us through the process.
Simple, Determined, Trustworthy: The Perfect Voice Assistant
Speaking of guidance: this aspect is a major factor in dialogue design. We always talk about user guidance but mostly mean click paths that were placed more or less cleverly. With a speech technology, we can cultivate real guidance. I can see a massive advantage in that.
With all that, speech interfaces still need to stay easy to control. In fact, voice interfaces only make sense in the long run if they always present the easiest options. Otherwise, users will always look for the other, easier alternative. Over time, we’ll have to forgo the wake-up commands. After all, starting every sentence with “Alexa,…” is not natural. Instead, the assistant would have to be able to tell when he’s being queried from the context. I’m well aware of the data protection law aspect of this statement, and it would have to be taken care of.
Even after all the addressed improvements, the voice assistant can only become a true companion if it embarks on us. To do so, it would have to learn and embrace our peculiarities. Otherwise, it will remain the synthetic information source that it is now. We rarely trust it, and are even more reserved than we are on a website. We’ve known the latter for almost a quarter century, so we simply know it better than the talking algorithm.
Can You Follow Me?
Last but not least, a massive skid will be the result of the fact that speech does not equal text. A text is always formal and defined. Speech has accents, dialects, sociolects, a wide variety of articulation options. Can the online shop afford to force potential customers to use that formal speech when that potential customer wants to order? “Sure”, is what you can say, but in the end, this potential client is alone in front of his computer, and the decision if he orders or not is all his. I think that we have to adjust our technology to the existing peculiarities of humans, rather than forcing these humans to remove their peculiarities in favor of the technology.
As you can see, there’s a very long way to the real voice interface. The voice assistants that you use today will barely resemble the interfaces of the future.