Things are about to get really interesting.

Artificial intelligence has already gone mainstream. Companies are using NVIDIA's (NVDA -2.55%) graphics processing units, Xilinx's (XLNX) field programmable gate arrays, or their own customized chips to train machine-learning models to recognize a variety of different inputs. This method of training neural networks is the technical reason Tesla's autonomous cars can recognize stop signs and Facebook's social network can recognize faces. 

But we're reaching an exciting and strange new era, which is based on a process called machine-learning inference. Different than training, inference involves computers taking everything we've taught them to produce something entirely new.

An image depicting a humanoid robot with the world as its brain.

Artificial intelligence is gaining traction globally. Image source: Getty Images.

Take this site as an example. It shows high-resolution pictures of normal-looking people, who could easily be your coworkers or live next door.

But the catch is, none of these people actually exist. Each of the pictures is fake, artificially created by a generative adversarial network that has been trained on what eyes, noses, and hair tend to look like. AI has created something on its own, based on everything we've taught it.

This same concept of machine inference is what allows Google (GOOGL 0.49%) Duplex to make appointments for you, or for Amazon (AMZN -0.41%) Alexa to proactively make products recommendations for you.

Behind the scenes, inference takes a ton of computing horsepower to actually work. Amazon Web Services estimates that inference can account for up to 90% of the computing costs required for any given application.

In other words, we need to innovate computing hardware. We can't just run everything on Intel's central processing units (CPUs) any more, at least without running up a huge electricity bill for all the power consumed. The race is on to design new chips and software ecosystems that can run inference most efficiently. 

The Path Forward

To help us see what the future holds, I recently spoke with IBM (IBM -3.38%) fellow and chief agitator John Cohn. John has one of the most innovative computing minds on the planet, with more than 116 patents and 36 technical papers to his name after nearly 40 years at one of the world's largest companies.

In our conversation at Austin's South by Southwest conference, John discusses why and how artificial intelligence became so popular, and the growing role of AI accelerators. He also explains why he is a fan of using Field Programmable Gate Arrays (FPGAs) for innovation, but of custom silicon chips for higher-volume commercial applications.

Our conversation is captured in the following video, with a full transcript also included below.

Transcript

IBM Fellow John Cohn: Well let me just say that I'm a big fan of hardware. I came from that. And it's very interesting.

When we were talking before about how the cloud was about to take over the world. Well, like many things, the actual truth is somewhere between. There's going to be a rebalancing between local hardware and cloud hardware. On both of those sides, there's going to be a lot of advances in technology. Silicon technology, which is kind of like after Moore's [Law], things were starting to level out.

Well there's a lot more work in architecture about acceleration, etc. Like GPUs, TPUs. We just announced a billion-dollar investment in Albany on a group that's actually looking at technology approaches to AI.

Motley Fool Explorer Lead Advisor Simon Erickson: An accelerator, you're saying -- executing the code more efficiently and quickly?

John Cohn: When you say code, you look at a structure like a GPU. GPUs work for AI because in a neural net -- let's say in deep learning -- you're just doing a whole lot of linear algebra. You're doing a whole bunch of multipliers. That's basically it. The same thing that makes graphics really smooth for a game actually is just doing a lot of matrix multiplies.

So in 2012 when somebody said, "Hey, let's try using commercial GPUs," well, it was a combination of using things like CUDA but then building layers on top of it. Whether it was Pytorch, TensorFlow, Octave, whatever to be able to free you up from the gorp of actually writing the CUDA code yourself. That's when AI and deep learning started to take off.

Those accelerators, that's great, except the sort of amount of computation you can do with a certain amount of time and a certain power budget. Because ultimately, you've got to fit it all in the same box. It's getting limited, because GPUs were not designed to do that. Now a lot of companies, including us, are working on more special purpose accelerators, so-called TPUs.

But we're even looking beyond that. I'm about to install a hardware cluster at MIT that's about 112 kilowatts. It's a lot of power. Our human brain is about 20 watts when you're sitting there. There's a lot of room to improve.

This place in Albany is looking at all sorts of acceleration technology. We're looking at analog technologies. We're looking at phase change memory, MRAM. To be able to do analog computing on these models of neural nets.

Not necessarily to run code, linear code, like you would do on a GPU. But to actually do the analog calculation that a neuron model would do in analog. With less precision but a whole lot less power. A whole lot less power can be -- because of parallelism -- can be traded into a whole lot more performance or a whole lot bigger model.

Now, you asked about FPGAs. They're different strokes for different folks, right? You've got to figure out what you're trying to do.

I'm a huge believer in FPGAs as a technology for doing innovation. One of the keyest things that you need to do is you need -- in that notion of being able to play with something -- you need to be able to make very quick turns of innovation. You need to try some, run real workloads on it, representative set, and then make some changes. 

Simon Erickson: What do you need to make the changes to, though?

John Cohn: To the actual architecture. So if you're really trying to optimize power performance -- which is a box -- "How much performance can I get within a certain power budget?" That's basically what it's all about. That is tuning. For many years, we just tuned the software and the hardware was what it was. Well we can't really afford to do that now, when the next turn of the crank doesn't give us more performance on hardware. As you said, right?

So what we have to do is we have to be able to really co-optimize the software layers and the hardware layers a lot more. Almost like the early days of hardware. Almost when the days when you could count your memory bits. You had to really, really care where every picowatt was going.

When you're trying to do that, you create a computation structure like, "Well do I do that in 64 bits, or do it in 32 bits, or do I even do it in eight bits?" Certain calculations in image recognition are actually far faster and far more power efficient in lower resolution with the same accuracy. Go figure, right?

Well the only way to create a hardware acceleration of that is to be able to radically change the architecture of the accelerator. To do that as a chip cycle, could take many millions of dollars and four months. Or three to six months of making a new chip. Well, you can't really afford to do that.

Simon Erickson: Yeah.

John Cohn: FPGAs are a rapid prototyping. I can get near custom hardware performance, but in a day. I can make a change.

As a deployment technology -- like if you were going to go make a deep learning thing -- it's a kind of diminishing return. At some point, you spend so much more money and you really get hit for cost, density, and power. That if you've got any sort of volume at all, once you've tuned it in, it makes sense to do a chip. If you have a very small niche, something that you don't need very many of, then the actual complexities -- the cost and risk complexities of actually building a custom chip -- may not be a good idea. If it's very special purpose, "I'm just recognizing one specific sort of image and I need to accelerate it because I'm doing something real time," then an FPGA might make sense. But if you have any sort of volume, I personally think, that you need to go to [a custom chip].

What's interesting, too, is that there are hybrids between what's a standard FPGA, which can emulate any sort of logic, and field-programmable combinations of higher functions. You'll see things that are actually higher-level composable units that you could customize, somewhat, but don't have the overhead of making the individual logic units in FPGAs.

Simon Erickson: Best of both worlds.

John Cohn: Yeah, and that's kind of a balance. Ultimately, look at something like bitcoin mining. You eventually had to go to -- and I'm not a big fan of bitcoin mining -- but you have to eventually go to a special purpose to stay ahead competitively.

Simon Erickson: The thing that I'm trying to answer, the fundamental question I have, is it seems like all of the cloud companies are now using, or starting to use, FPGAs, right? Machine learning inference as a service. Why are they using FPGAs?

John Cohn: For flexibility. You can customize the logic to a workload. I personally believe, as a hardware guy, that that will change. We're just at a new cusp. Where you need that flexibility. The workloads that are happening for inferencing. Things like inferencing are kind of nuanced. When you actually end up in a world where you're doing something like a GAN, generalized adversarial network. What we would call inferencing, actually has a fair bit of forward calculation, computation in it. You need acceleration on the way out. You're not just doing a simple pass forward.

Those kind of things are new. We don't know what that hardware should look like. I personally believe that it'll eventually get to a point where we'll be able to choose a couple of classes. It'll eventually be sort of a combination of bigger components and eventually customized silicon. But I'm a silicon guy!