Those who have owned Pentium processors have also used Intel's P5 micro architecture, which utilized a 5-stage pipeline. Once the Pentium Pro was released based on the P6 architecture, the branch prediction/recovery pipeline was doubled to include 10-stages. Now that the Pentium 4 is upon us the pipeline depth has been doubled once again to 20 stages and given a confusing new marketing moniker: NetBurst Micro-Architecture.
Quite simply, the deeper pipeline provides for increased scalability, which has allowed Intel to debut the Pentium 4 at speeds of 1.5 and 1.4GHz using the same etching process as the Pentium III. As yields improve and the .13-micron process is brought online, these frequencies will likely ramp very quickly, reaching 2GHz by the middle of next year.
Not all things are peachy in the land of the 20-stage pipeline, however. By doubling the depth of the branch prediction pipe, the penalty associated with mis-predictions is greatly increased - rather than flushing 10 speculatively executed instructions, the Pentium 4 has to flush 20, and start the execution over again in the correct program branch. The recovery time on the 20-stage pipe is much longer than the 10-stage pipe, resulting in a lower average number of instructions successfully executed per clock cycle.
To compensate for the lower IPC, Intel has implemented a couple of features that greatly reduce the inherent mis-predict penalty - Execution Trace Cache and the Dynamic Execution Engine.
Level 1 cache is normally split between the instruction and data caches, both of which are 16KB on the Pentium III. This go 'round, Intel has decreased the data cache to 8KB and has re-implemented the instruction cache to store micro-ops in the path of the program execution so that results of program branches are integrated into the same cache line. Latency is eliminated because the execution engine can retrieve decoded operations from the cache directly, rather than fetching and decoding commonly used instructions over and over again. In addition, instructions that are not used do not get stored in the cache, making the Execution Trace Cache more efficient than previous implementations.