This article is the first in a two-part series aimed at exploring the rapidly developing state of voice computing from the investor's perspective. In Part I, I discuss why it's now increasingly clear that the human voice will become society's next major computing interface, and why it will continue proliferating through our homes and businesses. In Part II, I'll dive deeper into the new and emerging business models. In Part II, I'll discuss the various players involved in the race to commercialize the technology -- those best-positioned to profit today, those positioning themselves to thrive tomorrow, and those dropping the ball entirely. I'll also dive deeper into the fascinating business models voice computing will soon enable, and how those models will evolve as the technology continues to improve.
For nearly 40 years, we've seen the computer evolve into an indispensable accessory of Western society driven by three powerful technologies that defined their eras and played a monumental role for investors along the way.
The personal computer gave many of us our first glimpse at the digital world, normalizing the idea that such machines have a place in our homes and everyday lives. The internet tapped us into that world, offering us endless floods of information, entertainment, and commerce around which we willingly began shaping our civilization. And then the smartphone encouraged us to take that digital world everywhere we go, leaving nothing but a few taps on a touchscreen between us, our computers, and constant immersion.
Now, a fourth technology -- voice computing -- is removing that last barrier.
Reimagining the roles computers play
Imagine driving down the highway, coming home from work, and without taking your hands off the wheel or hitting any buttons, asking your car to turn up the heat in your house, preheat the oven for dinner, and play your favorite album, all in the same breath. Picture negotiating with a street vendor in Beijing, having your phone translate the entire conversation between Mandarin and English out loud, so you can haggle in real time. Or how about a stuffed animal that can tell your daughter interactive stories every night before bed, reshaping the adventure at every twist and turn based on her responses and reactions.
These aren't hypothetical scenarios -- they're happening right now.
The human voice is well on its way to becoming the next major medium we use to interact with computers, and as we'll explore in this two-part series, voice computing is the burgeoning field of technology making it possible. Already spanning multiple consumer and enterprise ventures (think call center routing, digital dictation, or the virtual assistants on smartphones), the industry's ultimate aim is enabling people to carry on hands-free, conversational-style interactions with computers.
It's an ambitious prospect capable of significantly expanding the roles computers play in our everyday lives, but it's also one that appears inevitable when considering the trajectory computers have taken, without fail, since their introduction, evolving in shape and form around contemporary technology to solve unmet needs.
And yet voice computing's evolution is different. Unlike any form of computing to come before it, the technology bypasses the physical realm, allowing users to query the internet, manage their surroundings, and connect to third-party services without ever having to touch a physical device -- all in the time it takes to form a sentence.
If that sounds unnecessary, then consider just how limitless the applications could be. From the surgeon who needs to rapidly adjust the conditions of her operating room mid-procedure, to the disabled veteran who simply hopes to access the appliances in his home. The sheer number of everyday applications for such a technology has prompted Grand View Research to project the global market for voice recognition technology will reach $128 billion annually by 2024, an estimate that may undershoot the overall economic opportunity at play considering the potential verticals include healthcare, defense, entertainment, consumer goods, and more (I'll discuss these in more detail in Part II).
There is one thing we can be certain of, however. Voice computing is extremely likely to transform the way we interact with technology in society, but before that can happen, a large-scale societal shift toward adopting the interface must occur, a shift that cannot occur until several key hurdles are overcome.
Understanding the challenges ahead
To analyze those obstacles, let's turn to the most promising application of voice-driven technology today: virtual digital assistants.
Thanks to the success of the iPhone and Apple's (NASDAQ:AAPL) other devices, Siri is currently the most widely used virtual assistant in the United States, but until recently, it's also perfectly exemplified what's holding voice computing back. To put it simply, speaking with Siri can often feel more like a novelty than a time-saver, a sentiment shared not just among Apple users, but most smartphone owners with access to similar voice computing technology.
It's a problem we can break down into three fundamental hurdles.
The first is that users must often resort to awkward combinations of hyper-specific phrasing and hurried robo-speak to have their speech accurately recognized. Andrew Ng, chief scientist at Baidu (NASDAQ:BIDU) and one the most respected names in the field, perhaps put it best in 2015 when he noted, "Speech recognition, depending on the circumstances, is say 95 per cent accurate. So maybe it gets one word in 20 wrong. That's really annoying if it gets one in 20 wrong and you probably don't want to use it very often. That's probably where speech recognition is today."
The second hurdle is that even when our speech is correctly recognized, there still remains a strong chance it will be misinterpreted while undergoing natural language understanding, a process in which the assistant attempts to comprehend what we want using the sequence of words (and eventually, the tone, pace, and inflection) we give. And even if that speech is correctly understood, an assistant may still simply be unable to solve for the request due to technical constraints. Examples of both cases can be seen in the screenshot below, along with a sense of how frustrating the encounter can be.
Encounters like these make us feel as though we're wasting our time trusting our phones to do what we could have done ourselves with only a little more effort upfront. Over time, they build up, eventually discouraging us from even trying to use our assistants for anything complex, instead relying on them only for the most trivial of tasks, like setting alarms or creating reminders.
Thankfully, these first two hurdles are solvable with data, third-party partnerships, and time, each of which the major players in this space have been actively accruing for years. Returning to our example, since Siri was introduced to the world in 2011, virtual assistants have grown far more capable of correctly interpreting requests thanks to advances in deep learning, a thriving field of AI being utilized by Alphabet (NASDAQ:GOOG) (NASDAQ:GOOGL), Amazon (NASDAQ:AMZN), Baidu, and others to improve speech recognition, natural language understanding, and other components of voice computing at astounding rates. (Click here for an outstanding story by The New York Times on Google's significant role in making this happen).
Equally as important, decisions by Apple and its competitors to open their platforms up to third-party services have made today's voice-based assistants feel far more capable than in years past. These decisions allow Siri, for example, to call you a ride using your Uber account, or order a pizza through your stored Domino's information -- not unlike an actual secretary would do.
As the potential behind voice computing becomes increasingly obvious in the coming months and years, we should expect more and more third parties to hop on board, eventually expanding the scope of what a partnership could entail (instead of ordering an Uber, imagine ordering a plane ticket). One look at the rate of growth for third-party commands available on Amazon's voice-based assistant, Alexa, can tell us a lot.
With time, these technical achievements and partnerships will continue compounding, eventually making for a highly practical voice computing experience. Even marginal improvements in the technology will convince users to begin using voice as an interface more frequently. In the same talk regarding the current state of the voice industry, Ng went on to say, "I think that as speech recognition accuracy goes from say 95 per cent to 98, 99 to 99.9, all of us in the room will go from barely using it today, to using it all the time." He added, "Most people underestimate the difference between 95 and 99 per cent ... 99 per cent is a game changer."
Although we're not yet able to predict when the technology will reach that point, if the recent pace of progress is any indication -- since Ng's speech, Baidu has built a program capable of dictating English three times faster, and with 20% less errors, than human beings -- it won't be much longer.
In the meantime, an arguably far more important hurdle stands in the way. It's a problem shared by every other major computing medium in use today, only voice is actually solving for it.
An instantaneous computing experience
No matter the medium, we've always interacted with computers by issuing commands. In the 1950s, we meticulously punched those commands into cards before feeding them into hulking machines. Now, technology allows us to simply type, click, and, increasingly, speak them and expect a result.
Yet despite these advancements, today's mediums still subject users to a unique set of delays, a sort of friction that inherently slows down the process between thinking a command and actually being able to communicate that command to your devices. A decade ago, that friction most likely entailed tracking down a PC with internet, using the mouse to open a browser, using the keyboard to run a search, and then sorting for the result. But today, for most Americans, it simply means reaching into our pockets for our phone, scrolling with our thumb, and tapping away.
It's a process so effortless, the average American now checks their phone an estimated 46 times per day. It's even become fashionable to argue there's no longer enough friction separating society from our devices; that we're too plugged in to the digital world.
So when you consider that voice computing's primary advantage is eliminating friction, it becomes tempting to dismiss the technology as a solution to a trivial problem. What that argument critically fails to consider, however, is that friction is about more than simple convenience, and that any amount of friction, no matter how small, fundamentally limits the dynamic between a user and their computing medium in two ways.
Let's take the smartphone for example.
The seemingly trivial motions we make when reaching for our phone still require us to be in immediate proximity of the device. We have to walk over to kitchen counter if we left our phone there; we have to remember it's on the counter even before we can do that. Then, once we've reached our phone, we're required to physically handle it, a process that generally requires we have one hand free and able.
That may not sound like much, but these limitations force our computing experience to revolve around the location of our device, as opposed to it revolving around us. They can create roadblocks for those with disabilities affecting their vision or hands (or really, even those who just have sticky hands, wet hands, or their hands full), those far away from their device, and most importantly, those with an immediate computing need.
There's a reason speech is our most instinctual form of communication.
If we have a question, and if we know those around us can answer that question, we don't pull out our phone and run a search, we open our mouths and ask. We do that because it's instantaneous. Now contrast that with having to locate your phone, having to type in your password, having to navigate to an app, having to tap, type, or speak your command. These behaviors are not instinctual, but we put up with them because the people around us usually don't have the answers, and because smartphones are our best alternative.
But they don't have to be.
The rise of voice
It's not obvious quite yet, but today's physically bound mediums have left a large hole to fill and voice computing is starting to fill it.
Over the past two years, the technology has begun manifesting itself in the form of always-on, always-listening devices able to virtually bypass the friction inherent in today's major computing mediums. Currently, these devices are best exemplified by Amazon's line of Echo smart speakers, which employ far-field voice recognition technology allowing users to command Alexa from across multiple rooms and through various background noises at any time simply by calling for it out loud.
Put another way, for as little as $50, Amazon's Echo is giving consumers their first glimpse at a frictionless computing experience. Following its initial setup, there's no pressing buttons, no looking around for the device, and no navigating through screens to issue a command. When it works correctly (like Siri, Alexa is still very much in its infancy and subject to misinterpreting commands) it really does allow users to convert their thoughts to a command in the time it takes to form a sentence, capturing the essential promise of voice as a medium.
And if the early estimated adoption rate for the device is any indication -- Amazon hasn't officially released sales figures, but as of January 2016, Consumer Intelligence Research Partners (CIRP) estimates a staggering 8.2 million Echos have already been sold in the U.S. (a number that pales in comparison to Morgan Stanley's estimates of over 11 million), suggesting the device has achieved well over 5% household penetration -- early demand for a seamless computing experience is strong, and will likely grow stronger as capable competitors race to introduce their own offerings (see Google's Home, Nvidia's (NASDAQ:NVDA) Spot, or Baidu's Little Fish), and as voice recognition, natural language understanding, and third-party support continue improving at rapid pace.
To be clear, these devices will never replace today's existing computing mediums, but will rather augment them. While screens will remain the preferred medium for our more detailed and visual computing needs, voice will begin its ascent by taking on an increasing share of our more "conversational" demands, common requests we want addressed with the immediacy we expect from dialogue ("Cancel my doctor's appointment." "Pause that music!" "What's a good non-fiction book to read on the beach? ... Great, order that one.")
As the technology grows capable of handling increasingly nuanced and complex requests, and as voice-enabled devices proliferate and evolve, we can expect more users to grow comfortable engaging with the technology around the home, in the workplace, and gradually more pervasive aspects of society.
And that's where the real opportunities begin.
The opportunity at hand
Some of the biggest tech empires in modern history have been built around the personal computer, internet, and smartphone, each of which has provided the infrastructure for technology to penetrate deeper and more reflexively into everyday life, allowing creative companies to reach consumers in entirely new ways.
In just 10 years, the iPhone helped Apple grow to become the largest corporation in the world, but it also popularized smartphone technology in the U.S., helping to create the infrastructure that in turn enabled Facebook (NASDAQ:FB), Microsoft (NASDAQ:MSFT), Alphabet, Uber, Snap, Amazon, Shopify (NYSE:SHOP), and countless others to create businesses that reach deeper into our lives than any before.
And yet voice technology is going deeper.
It's becoming an invisible, always-listening medium that surrounds us, weaving itself more seamlessly into the fabric of our lives than any before it. It's called on using our most instinctual, reflexive form of communication, and it's accessible in the time it takes to form words. It's a technology that provides tomorrow's innovators a far more intimate and pervasive way to help us satisfy our digital wants.
It's a startling but increasingly inevitable reality for both investors and society.
In Part II: The Playing Field, I'll discuss various players involved in the race to commercialize voice computing. To be the first to know when Part II is released, follow me on Twitter: Follow @asgariaj.
Suzanne Frey, an executive at Alphabet, is a member of The Motley Fool's board of directors. Armun Asgari owns shares of Amazon, Apple, Facebook, and Nvidia. The Motley Fool owns shares of and recommends Alphabet (C shares), Amazon, Apple, Baidu, Facebook, Nvidia, and Shopify. The Motley Fool has the following options: long January 2018 $90 calls on Apple and short January 2018 $95 calls on Apple. The Motley Fool has a disclosure policy.