Inside Intel Again: Cold Hard Cache

Remember how your mom always said what's important is on the inside? Well, Rob Landley continues his journey inside Intel.

I'm spending all this week describing the evolution of microprocessor design optimization strategies. Because I want to, that's why. A couple of clarifications on yesterday's article before we jump into the new stuff:

First, the opening quote I used should be attributed to Groucho Marx. Dave Awerbuch pointed this out to me, and I thank him.

Secondly, Intel (Nasdaq: INTC) didn't necessarily invent the optimization techniques I'm describing here. As I've said before, Intel's strength rests more in manufacturing than design. The design techniques I'm describing here came from laboratories and universities around the globe, and were usually first used in mainframes from IBM, Cray, or others years before Intel could apply them to microprocessors. The PC spent years catching up to the state-of-the-art in mainframes, and didn't really pass it until the early 1990s.

Still, given the problem of maintaining compatibility with existing programs written for the IA32 (Intel Architecture 32 bit) instruction set inherited from previous generations of hardware, figuring out how to shoehorn in design techniques inherited from mainframes required a bit of original cleverness itself. And in any case, no matter where they came from, the new designs did their job of making the PC that much faster than it would have been otherwise. So let's get back to the action, with the 80486 ("the 486").

Intel's sequel to the 80386 microprocessor, cunningly named the 80486, had two main themes to its design: "Throw transistors at the problem" and "separate what's going on inside the chip from what's going on outside the chip."

Extra Circuitry

The 486 executed many instructions faster than the 386 did by simply having bigger circuits with independent components to do different steps side-by-side, rather than re-using the limited circuits of a smaller chip to do the same steps one after the other. The 486 had a bigger "transistor budget" than the 386 did, due to manufacturing improvements that allowed more transistors to be packed into the same space on the wafer, and it used some of that to reduce the number of clock cycles that commonly used instructions took.

Extra circuitry was also added to the 486 to more rapidly perform floating point calculations (i.e., not just 1+1=2 but 1.639+2.7=4.339). This circuitry had previously been available as a separate add-on chip called a "Floating Point Unit Co-Processor" (FPU), so most of the available software knew how to take advantage of it already, and it was now added to the 80486's design as standard equipment.

On-Chip Cache

Separating the inside of the chip from the rest of the world started with the idea of putting a tiny amount of memory on the same microchip as the processor. This new memory was called "on-chip cache," and its job was to remember the last few instructions (and most recently used data) the processor had read from the computer's main memory. Programs spend most of their time executing the same sets of instructions over and over -- for example, drawing the same 26 lowercase letters on the screen. This means a lot of the time the chunk of memory the processor needs to look at next is something it saw just recently, and if a copy of it is still in the on-chip cache, the processor doesn't need to wait for main memory to retrieve it. (When the cache gets full, the oldest information in it gets discarded to make room for new information.)

On-chip cache allows the inside of the microchip to run at a faster clock rate than the rest of the system without spending most of its time waiting for the RAM chips on the motherboard to catch up. This means you can stick an expensive processor made in a state-of-the-art microprocessor fabrication center on a cheap motherboard from some nameless factory in Taiwan, and the processor won't spend most of its time twiddling its thumbs waiting for data to arrive from or be accepted by the rest of the system.

Part of the reason that complex instructions taking several clock cycles to complete had been OK in the first place was that most PC memory was slow, and needed "wait states" to be usable in a faster system. This meant the CPU could only get fresh information from the RAM chips, say, every third clock cycle, and if it finished what it was doing in less than three clock cycles, it had to wait for the next thing to do to arrive from memory. But if the 486 had something to do in its cache, it didn't have to wait, so streamlining the instruction circuitry to reduce the clock cycles each instruction took was once again a profitable thing to do.

Prefetch Unit

To take advantage of the cache more efficiently and reduce the amount of time the CPU spends waiting for new data to work on, the CPU can read ahead to get data into the fast cache from the slow main memory even before it's actually needed. This is handled by circuitry on the processor called a "prefetch unit" that sucks in data when the circuitry connecting the CPU with the RAM (the "memory bus") would otherwise be idle. That way the processor doesn't always have to slow down (via wait states) when it isn't executing the same instructions over and over again several times, but can advance into a fresh section of the program at full speed for a little while, because the prefetch unit pulled fresh instructions into the cache for it while it was running in a loop earlier.

The simplest type of prefetch unit just fetches the next few bytes after whatever the CPU is running right now, if they're not already in the cache. More complicated ones understand "jump" instructions that tell the processor to go execute some other part of the program, and load in not just what comes next in sequence, but what the program might need next if it executes the jump instruction when it gets to it.

Clock Multiplying

After producing the first generation of 486 chips, Intel increased the separation between the inside and the outside of the processor even further with the introduction of "clock doubled" processors. If you put expensive, fast RAM (with no wait states) into a machine running a 486, the 486 can work even faster because of the cache. The outside of the chip is designed to wait for the RAM while the inside of the chip runs stuff out of the cache. If the RAM is going at the motherboard's full speed, the processor can go even faster than the motherboard can.

The on-chip cache allowed Intel to crank up the processor speed faster than the highest-end motherboard manufacturers could keep up. A "clock doubled" 66 megahertz (MHz) processor could talk to the rest of the computer at 33 MHz, and work at 66 MHz internally as long as it was executing instructions out of its on-chip cache. When it tried to do something that wasn't in the cache, the processor still had to sit there and do nothing until more data arrived from main RAM, so in reality it didn't run at the full 66 MHz all of the time. But that was the case with slow RAM that had wait states anyway, and on average the faster processor speed was still a big improvement.

A 66 megahertz processor receives 66 million clock signals per second, and is capable of executing 66 million instructions per second if each instruction only takes one clock cycle to complete and it never has to wait around for the RAM to feed it more data. You were wondering when I'd get around to explaining that part, weren't you?

Intel didn't just clock-double 486 processors, it clock-tripled some by fitting a 75 MHz 486 chip (486/75) in a 25 MHz motherboard, and a 486/100 in a 33 MHz motherboard. It got a bit confusing, actually, since a "100 MHz 486" could either be a clock tripled chip that fit in a 33 MHz motherboard, or a clock doubled chip that went in a more expensive 50 MHz motherboard. You had to check to make sure that the motherboard and the chip you were buying could work together at full speed.

Cache was introduced to compensate for slow RAM, and when the RAM caught up with the motherboard's capabilities, the processor could shoot ahead of the motherboard with clock doubling and clock tripling. But there's still a limit to how much faster the CPU can be than the RAM it's interacting with. A larger cache and a more clever prefetch unit helps, but a clock tripled system still spends more time waiting for something to do than a clock doubled system does. There's no point making a processor faster if it's just going to waste those extra clock cycles waiting.

So the 486 design reached its limits, and a new approach was needed to progress further. Tomorrow: the Pentium.

Elsewhere in Fooldom tonight, a couple of Fools are dueling on Disney (NYSE: DIS), a company that many used to consider the premiere Rule Maker in the media world. Check out this week's Dueling Fools to see if the Mickey/ABC/ESPN kingdom still has any dominance left in it.

Premium Investing Services