AMD's Athlon Part II

Although the K7 is a completely new 7th generation processor, much of the technology incorporated in the design has fairly deep roots. AMD has simply put all of these different technologies together and streamlined their interactivity to produce the K7. Generally, increasing things like the depth of the execution units and number of registers is enough in itself - if you can pull it off - to increase performance but that is only partially the case for AMD's new baby.

Branch Prediction
The K7 sports a smaller 2048-entry Branch Prediction Table than the K6 family. Although I have currently been unable to ascertain why this is so, I expect that the larger 12 entry return makes for a faster turnover for incorrect predictions. A Branch Prediction Table is sort of a history table which stores an entry for each conditional branch executed by the CPU while running the current application. The K7 compares the data it receives against this table and makes it's best guess as to which branch to direct it to.
The K6, with its short 6 stage pipeline, had few problems with incorrect branch predictions as they cost only 4 clock cycles. A branch prediction miss on the more deeply pipelined (10 stage) K7 will cost more than the 4 clock cycles, that is why is seems unusual for AMD to implement such a small branch prediction table. Prediction rates of 90% to 95% are critical to make sure that a so deeply pipelined, superscalar CPU does not waste clock cycles! I will be updating this preview to review status once I get a hold of a K7 and perhaps will know more then.

Universal x86 Decoders
Decoders translate the variable length complex x86 instructions into small, fixed length RISC-like operations. While both the PII/III and K7 each have three decode units, all three on the K7 are full universal arbitrary x86 decoders. The PII/III is limited here in that only one of the three decode units is a full arbitrary decoder. The other two can only perform simple x86 instructions decode operations. This means the K7 should sustain a higher, more fluid decode rate.
x86 instructions are handled in two ways. Simple instructions of 1-15 bytes in length, which are the most common, follow what AMD calls DirectPath, which is streamlined for fast execution. For the few complex instructions, the k7's VectorPath is used. The x86 instructions are converted into simpler MacroOPs Decoding Pipelines can dispatch as many as 3 MacroOps to Execution Unit Schedulers at a time. Each MacroOp consists of one or two Operations (Ops)
These Ops are then issued to the execution units

Integer Execution Units
Along with it's 3 FPUs, which we covered in Part One, the K7 provides three integer execution units and address generation units for a total of nine execution units supporting the flow of these decoded MacroOps through the processor generating up to 2.5 instructions executed per cycle - outperforming the rate of 2-2.1 executes by the PIII. To further aid this process, the K7 uses a 15-slot instruction scheduler. This is needed for out of order execution. When an execution unit becomes available, it can be fed with an out of order instruction which eliminates wait states while the preceding instructions finish executing - if, that is, there are no dependencies between the instructions. The K7's integer units also are capable of speculative execution. As with Branch Prediction, the integer unit makes it's "best guess" as to the execution order. Speculative execution can be an invaluable time saver provided the integer unit's guess is correct and since it guesses correctly better than 90% of the time, on average, aids data execution significantly.

0.25 micron Process Fabrication and Die Size
The first K7 chips will still be produced by AMD's Austin, Texas-based fab 25, which AMD promises will shift over to 0.18 micron process in the second half of '99, but AMD's new Dresden fab 30 in Germany is set to produce 0.18 micron K7's.

AMD showed off a 600MHz 0.25 micron CPU at CeBit, demonstrating that their 0.25 process could handle the higher frequencies. But, don't expect AMD to produce 0.25 micron parts for long. Moving the K7 from the 0.25 micron process die at a size of 184mm˛ to the 0.18 micron process will reduce the die size to a much smaller 104mm˛. AMD also hopes to move from aluminum interconnects to faster cooler copper technology later this year which should provide stability for speeds of 1GHz and maybe even higher. Yep, that's right folks, 1 GIGAHERTZ!  
At 184mm˛, the 0.25 micron K7 die is bigger than the PIII, and die size can hold a direct relationship with the pure speed of the CPU. Larger die chips require more power to run efficiently and power increases tend to increase heat which tends to decrease performance.

The large die size presents a few logistical problems as well. Increased die size means fewer chips per wafer and lower yields that will no doubt result in initially higher cost to the consumer. Once the Dresden fab ramps up, the prices should drop as production increases. This should hopefully put the K7 into the mainstream buyers market before Christmas.