Friday, July 31, 2009

Intel's i7: Codename Nehalem

It is hard to describe Intel’s release of the Core and Core2 series of processor as anything but a slam-dunk which has made the “blue” company the undisputed performance leader in the desktop segment. To further top it, with the migration to the P1266 45 nm process node including the Hafnium-based transistors, Intel has also set new standards with respect to power consumption. So, the bottom line is that Intel is currently the undisputed performance leader. One of the beauties, however, in the wonderful world of processors is that there is no standing still. The Red Queen rules and you have to run as fast as you can just to stay where you are.

Enter the next generation CPU - Codename Nehalem.

Between Legacy and Whodunnit-First

Any almost picture-perfect product puts the thumb screws on the designers and engineers envisioning the next generation, especially since nobody wants to repeat the Netburst scenario, which was not too bad at the time after all. At the same time, it never hurts to take a big step back and look at accomplishments past and, needless to say, also what the competition has achieved. Sometimes, the borders are not exactly clear-cut, a case in point is HyperThreading that originally was conceived as a feature of the Alpha EV8 architecture in the last millennium but refined and brought into production by Intel as part of the Pentium4 feature set. Integrated memory controllers were used by Intel in some of their processors that never saw the light of day but AMD was the first company to actually bring this feature to market. This list could go on and on but the actual point we are trying to make is that there is no point in accusing any one company of plagiarism. The entire process is called technological development and progress and without cross-fertilization, we wouldn’t even have computers to begin with. But I digress.

Nehalem Under The Hood

The Front Side Bus: Legacy Beyond Words

In all their beauty and power, Intel’s processors have had one Achilles heel, namely the (A)GTL bus, which, with the inception of the Pentium II and its back-side cache bus became known as front side bus or FSB. The backside of the cartridge along with the cache are long gone, and therefore, by definition, there can’t be a front side either but who cares, the acronym FSB has persistently been abused and become a colloquialism to describe the connection between the CPU and the system logic, particularly the memory controller hub. In detail, the FSB is a 64-bit, bidirectional point to point connection between the CPU and the NorthBridge, with additional command and status lines to indicate for example a cache hit.

With a single processor, the FSB (to stick with this atrocious misnomer) certainly suffices, at least up to a certain frequency and since the frequency race of single core CPUs is over, we may never know where the real border would have been. Add a second CPU, and all of a sudden, both have to share the same bus for memory accesses and communication with the rest of the system including DMA of any busmaster and the associated cache snooping. In the case of the P4, we measured as many as 400 CPU cycles going to waste for a simple memory access of a network card or other busmastering peripheral, during which no other traffic could be handled by the FSB. Any dirty cache occurrence also forces the data to be dumped from the first or second level cache to the main memory in order to be accessible by the second CPU.

Arguably, in the case of the Core2-based dual core processors, the sharing of the L2 caches between the two cores on each die greatly remedied the situation but even “intelligent die pairing” does not solve the issue of cache coherency between four cores on two separate dies. Even if they share the same package the two dual core processors still need to communicate among each other using the shared FSB. Along these lines, it only takes a look at the extremely poor memory usage efficiency of any quad core CPU, capped at some 65% of the theoretical maximum despite the use of prefetching and buffering and any other trick known to man, to appreciate the end of life symptoms of the architecture. At the time, a number of optimizations such as “Smart Memory Access” technology were implemented to ameliorate the issues at hand but even then and despite the fact that the Core2 in its various iterations delivered Best of Class performance, it was foreseeable that future revisions of the architecture needed to rely on entirely different strategies.

No comments:

Post a Comment