AMD's Athlon

The New King Of The Hill???

K7.gif (32850 bytes) Introduction
June is rapidly approaching and with it the expected release of a whole new processor architecture from AMD. The seventh generation x86 processor aptly named the K7 promises to outperform, clock for clock, virtually every x86 processor currently available by a fairly significant margin and will do so without requiring consumers to wait for enhanced software to make use of it's new architecture. In this, the first of three articles on the K7 we will endeavor, on a simple level, to unlock some of the exciting new changes to the x86 CPU as designed by Advanced Micro Devices, and see how they can work together to make the fastest consumer CPU come to life.

Features Found in the AMD K7:

Slot-A Architecture
As the editor of The Super 7 Hardware Guide I was somewhat distressed to see AMD moving away from my beloved socket 7 platform but from both a technical and cost effective solution the move makes sense for AMD. The first reports that came out on the platform suggested that the K7s initial release would be limited to the high-end server market with a huge on cartridge cache and a very high price-tag. This would require fine tuning to extreme tolerances and that is more readily accomplished on a slot-type architecture. This same slot-type architecture also facilitates the scaling down of the processor to meet the demand of the middle and lower end of the high performance CPU market such as the extreme gamer and mid-size business server arenas. It will also ease the pressure on system board manufacturers in that it is so similar to slot-1 that much of the tooling has been already perfected for the Intel Slot-1 mainboards.
The first system boards released to support slot-A architecture will use core logic designed by AMD but both VIA and ALI have vowed support and new chipsets should make their way to market before the end of the year. As far as I have been able to ascertain the first K7s released will operate at a frequency of 500MHz and be equipped with an on cartridge L2 cache of 512KB running at 1/2 core speed for the high-end server market and 1/3 core speed for the high-end consumer market.

128KB L1 Cache
As most of you know, cache is nothing more than high speed memory that is located closer to your CPU for faster access to frequently used data. The first place your CPU goes for data is in the L1 cache. If the data the CPU is looking for cannot be found in the L1 cache, or it fails to retrieve it in the current clock cycle, it looks for it in the L2 cache, If the data cannot be found there, or retrieved during the current clock cycle, The CPU has to acquire it from the slower system DRAM - although, in the case of the K6-III there is also the availability of a L3 cache on the mainboard itself which operates at the external CPU frequency - most often 100MHz. Double the size of any currently produced x86 processor, the K7's 128KB L1 cache is divided into a 64KB instruction cache and a 64KB cache to hold data. This could increase performance by as much as 20% on paper - note the drastic increase in performance by the introduction of 128KB L2 cache on the CeleronA - but a more realistic expectation would be nearer 8-10% even with a substantial 2048 branch prediction table, as the cache on K6-2/IIIs is already at 64KB.

512KB L2 Cache
Like the Pentium II/III from Intel, the K7 will have it's L2 cache on the PCB that the Slot-A cartridge is built upon. While this cache can range in size from 512KB all the way up to a whopping 8MB, first releases will no doubt have the smaller 512KB in the form of a pair of 256KB SRAMs flanking the processor core. The price of SRAMs increases dramatically with its rated operating frequency so it's a good bet that the consumer version of the K7 will be outfitted with cheaper -5ns SRAMs operating at 1/3 the CPU core frequency while the server designed K7s will carry -3.3ns SRAMs capable of performance at 250/300MHz. While the 1/2 core speed SRAMs may seem to offer a significant performance increase the fact that the SRAMs operate in burst mode means that the read latency of the SRAM bursts could cause the reads to be fully read at almost the same rate.

Alpha EV6 Bus Architecture
Implementing the Alpha EV6 bus on the K7 is a radical move for a x86 processor. Until now, x86 processor have generally used a bus protocol whereby the CPU, L2 cache, core logic (chipset) and system memory were all interconnected on circular bus circuit which was reliable and efficient as long as everything was operating at the same frequency. While the clock divider found in VIAs MVP3 super7 chipset worked admirably making it possible to run standard SDRAM in conjunction with the 100MHz front side bus, it nonetheless reduced CPU efficiency as calls for and reads of data from main memory were slightly bottle-necked. The EV6 bus on the other hand can operate efficiently at speeds ranging from 40 to 400MHz as the CPU is essentially taken out of the bus loop and centered between the L2 cache and core logic on a huge data pipeline that in the K7 will operate at 200MHz. As there is currently no PC200 SDRAM the initial release K7s will use a core logic clock divider no doubt similar to that implemented in the MVP3 to run on a 100MHz memory bus until new memory technology falls into place. This may take some time as AMD hasn't yet gained market strength to direct which memory technologies gain market acceptance and will, unfortunately have to follow Intel's lead on this matter. This may create somewhat of a game of catch-up for AMD but the company seems to have set itself up to cover as many bases as possible so that the jump to new memory platform will go as smoothly as possible.

Multi-Processing
The K7's use of the EV6 protocol also opens a door for AMD into the world of multi-processor systems, an area that Intel has had a lock on for the last several years. The EV6 implements point-to-point topology. This means that if there are multiple processors within the system, each gets a dedicated connection to the chipset. Intel based multi-processor systems must share a bus interconnected with the chipset. This is the technology that Intel has refused to share licensing for an so has pushed x86 CPU manufacturers away from developing multi-processor systems. This technology of Intel's is also limited to the use of only four processors within a given system. AMD's development of the EV6 protocol for x86 processors opens systems up to use as many as 16 processors provided the memory bus architecture can support it. Even though each processor gets a dedicated connection to the chipset is must share a system bus from the chipset to main memory. This will step up development of larger and faster memory types than current 64-bit access SDRAM. You can be assured that AMD will be developing for a variety of new memory types like RAMBUS, DRDRAM or DDR SDRAM.

Floating Point Power
The constant criticism of AMD's poor floating point performance may become a thing of the past with the release of the K7. The clever development of their 3DNow! SIMD instruction set though quite successful was seen by many as a cosmetic software enhanced cover-up for poorly implemented floating point execution. The K7 is likely to change all of that with it's implementation of 3 fully pipelined and superscalar floating point units within the processor core. Pipelining is a term we have all heard but many don't fully understand. Simply put it is a technique used in processing where the processor begins executing a second instruction before the first has been completed. That way several instructions are in the pipeline simultaneously, each at a different processing stage. This pipeline is divided into sections and each sections can execute its operation concurrently with the other sections. When a section completes an operation, it passes the result to the next section in the pipeline and fetches the next operation from the preceding section. The final results of each instruction emerge at the end of the pipeline in rapid fire succession. The three FPU on the K7 are each slightly different. The first FPU is referred to as FMUL and is responsible for multiplication and complex operations such as square root and division on floating point data. Complex operations cannot be intrinsically pipelined but since they don't stall the multiplication process completely another unit within the processor core called the Scheduler, tries to align the multiplication calls waiting in the K7's large buffer so that they can be more efficiently calculated during stalls in the process of a division instruction. The second FPU, completely independent of the others does addition and subtraction calculations and is referred to as the FADD unit, This is also the floating point unit responsible for the execution of the 3DNow! instruction set. The third and last FPU is referred to as the FSTORE and handles MMX and FPU stores for data to be calculated. It also handles some special complex instructions that you need to be a mathematician to understand. The three FPU are all independent and yet can operate in parallel even performing out of order data execution. Latencies on the K7's FPU are balanced out to match those of the Pentium II and should show a marked increase in throughput for the K7 during floating point intensive applications like 3D gaming, CAD and voice recognition applications.


In Part Two we'll take a look at the new decoders and integer execution units and branch prediction table, tie everything together and offer an idea of the performance you may expect out of the new K7 processor.  Until then think about how the technology described above will change the way your system processes data.   Unlike the PIII which requires application software to unlock the potential of the CPU's performance, AMD has sought, through design to increase performance by expanding and intermixing existing technologies and just increasing the speed of the K7 becomes one of the least relevant enhancements to the way we process data.