The second key to minimizing the branch mis-predict penalty lies with Intel's Dynamic Execution Engine, which keeps the Arithmetic Logic Units busy with instructions to execute. As opposed to the Pentium III, which only provided 42 instructions from which the execution units could choose, the Pentium 4 offers 126, increasing the probability that the data needed after a cache miss will be available immediately rather than having to wait to fetch it from memory. As processor frequency ramps upwards, this becomes increasingly important since system memory speed does not scale with the processor.
In addition to providing a greater window of instructions for the execution engine to choose from, enhanced branch prediction has also been provided to further reduce the number of mis-predictions. Intel estimates this number to be about 33% lower than the P6's branch prediction capabilities thanks to an enhanced prediction algorithm and a 4KB branch target buffer that stores detail on the history of past branches.
If you have yet to pick up on a recurring theme for the Pentium 4, here's a clue - execution. In order to further compensate for the lower IPC of the NetBurst Architecture, Intel has clocked the Arithmetic Logic Units at twice the frequency of the processor core. So, on a 1.5GHz Pentium 4, the ALU's are screaming at 3GHz with latency that is half the duration of the core clock.
We estimate that as processor speeds increase, the integer performance of the Pentium 4 will improve since the speed of the ALU units (which most significantly impact integer performance) escalate twice as fast.
One of the most dramatic additions to the NetBurst architecture is a quad-pumped 100MHz-system bus, delivering the equivalent of 3.2GB/s of bandwidth. The idea behind the accelerated 64-bit bus is to match the bandwidth of the dual RDRAM channels that also provide 3.2GB/s of theoretical bandwidth.
Of course the signaling scheme put in place by Intel could not be 100% efficient, so there is also a buffer to help facilitate sustained 400MHz data transfers. With such a high-speed bus in place, the Pentium 4 is able to push more than three times the amount of data as the Pentium III (which is limited to 1.06GB/s on a 133MHz bus). For the sake of comparison, AMD's 760 chipset armed with PC2100 memory is able to push a theoretical 2.1GB/s - something we do not expect to see changed in the near future since AMD's current roadmap shows the 266MHz bus in place for at least another 18 months or so.
Now, for the first time since the launch of the Athlon, AMD will have to play catch-up to Intel's micro-architecture design. There are, of course, positive and negative aspects to each technology. For example, a single stick of RDRAM is still more than twice as expensive as PC133 memory (remember that they have to be installed in pairs as well). While DDR memory is, for the most part unavailable, it should weigh in at about half the price of RDRAM when it does become more widely available next month. On the other hand, it still appears that the dual RDRAM channels will provide more performance than PC2100 despite the latency inherent with the Rambus architecture.
One interesting observation to make is the balance between the CPU, memory, and AGP busses. Whereas the i815 offered an even 1.06GB/s all-around, the i850 boasts 3.2GB/s from the memory to the memory controller hub and also to the CPU. Meanwhile, AGP bandwidth remains at 1.06GB/s, so we can only venture a guess that Intel is working on a revision of the AGP specification to keep AGP bandwidth up to speed.