Should you ever be in any doubt over the truth of Arthur C. Clarke’s Third Law (the one that states that any sufficiently advanced technology is indistinguishable from magic), just try a HoloLens 2 experience where you can tell holograms what to do.
Anyone who manages to walk away from that without feeling a little bit like a wizard or witch casting a successful spell is likely leading a fairly jaded and joyless existence. There is something fundamentally satisfying, almost primal, about watching the world change as you command it to. Like lucid dreaming.
Yet even the naysayers can probably agree that voice-based technologies driven by artificial intelligence represent a huge market opportunity. Millions of us now own some sort of room-based smart home device such as Google Home or Amazon Echo, and the global voice and speech recognition market size is estimated to reach nearly $32 billion USD by 2025. We are witnessing a convergence of various emerging technologies in that space – such as voice recognition, natural language processing, and machine learning – powered by 5G connectivity and the AR cloud.
Immersive experiences like those afforded by the HoloLens offer a tantalizing glimpse of where this is all headed: a screenless world where digital interfaces become a part of natural human interactions, creating an entirely new form of hybrid – or extended – reality. In fact, Gartner predicts that this year, 30 percent of web browsing will be done without a screen.
The next technology revolution will usher in the era of spatial computing, where multisensory experiences allow us to interact with both the real and digital worlds through natural, intuitive interfaces such as haptics, limb and eye tracking, and even elements such as taste and scent.
In this screenless world of spatial computing, interfaces will need to become more intuitive, efficient, and empathic. Let’s take a look at three ways in which voice technologies are already enabling this.
Spatial audio and AI-driven voice technologies are crucial elements for creating compelling immersive experiences. As Kai Havukainen, Head of Product at Nokia Technologies explained in an interview for Scientific American, “Building a dynamic soundscape is essential for virtual experiences to really engender a sense of presence.” Humans, he added, are simply hardwired to pay attention to sound and instinctively use it to map their surroundings, find points of interest and assess potential danger.
There are, however, design considerations that must be taken into account when tackling the challenges of an entirely new medium together with fast-evolving technologies.
Tim Stutts, Interaction Design Lead at Magic Leap, highlights the sheer complexity of these UX challenges, “A level of complexity is added with voice commands, as the notion of directionality becomes abstract—the cursor for voice is effectively the underlying AI used to determine the intent of a statement, then relate it back to objects, apps and system functions.”
“For voice experiences, you need to have a natural language interface that performs well enough to understand different accents, dialects, and languages,” adds Mark Asher, director of corporate strategy at Adobe, who believes the advancement of voice technologies will serve to “bring the humanity back to computing,”
There are still many hurdles to overcome before we reach that utopian vision of Star Trek’s universal translator, however. As we move towards more pervasive and complex experiences where users have multiple applications open at the same time, they will need to circumvent problems such as unintentionally commanding a hologram when you’re actually talking to the person next to you.
Yet looking at the exponential way AI technologies have developed over the past decade, it isn’t unreasonable to extrapolate that the next few years we will usher in real-time contextual applications that accurately identify and action commands based on accurate assessments of your surroundings (both real and virtual), your personal preferences, and even your biofeedback.
Extended reality (XR) technologies already deploy a multitude of sensors that enable the collection of biofeedback, yet voice provides a rich vein of data that can be collected without the need for cumbersome wearables.
Apart from deliberately using commands to interact with the world around us, our voices provide the scope for AI to contextualize our XR experiences based on subconscious factors such as our mood and physical health. Cymatics – the name given to the process of visualizing soundwaves – gives us some insight into the depth and complexity of the unique patterns projected by our voice.
To produce speech, the brain communicates with the Vagus Nerve and sends a signal to the larynx, which vibrates out stored information through the vocal cords. Since vocalization is entirely integrated within both our central (CNS) and autonomic nervous system (ANS), there is an established correlation between voice output and the impact of stress.
Researchers have been developing methods for voice stress analysis (VSA) and computerized stress detection and body scanning devices for many years. Companies such as Insight Health Apps already leverage this rich data to feed corrective waveforms and patterns back into the body in the form of “quantum biofeedback”.
Bridging the Uncanny Valley
When I was first invited to test the social VR platform Sansar, I was shown around some of its virtual worlds by Linden Lab’s CEO Ebbe Altberg. To this day, my lasting impression of that demo was how our interaction felt very natural in spite of us being 5,000 miles and several time zones apart (I was in London and he in San Francisco) and the fact that his avatar looked nothing like his real-world persona.