Home

News

Forums

Hardware

CPUs

Mainboards

Video

Guides

CPU Prices

Memory Prices

Shop



Sharky Extreme :


Latest News


- Acer Fires Up Two New Ferrari Notebooks
- Belkin Debuts Docking Station for ExpressCard-Equipped Notebooks
- Logitech 5.1 Speaker System Puts Your Ears At Eye Level
- Dell, Nvidia, and Intel Fire Up Overclockable Gaming Notebook
- Gateway Puts Premium Components Into Affordable Home Desktop
News Archives

Features

- SharkyExtreme.com: Interview with ATI's Terry Makedon
- SharkyExtreme.com: Interview with Seagate's Joni Clark
- Half-Life 2 Review
- DOOM 3 Review
- Unreal Tournament 2004 Review

Buyer's Guides

- September High-end Gaming PC Buyer's Guide
- September Value Gaming PC Buyer's Guide
- October Extreme Gaming PC Buyer's Guide

HARDWARE

  • CPUs


  • Motherboards

    - DFI LANPARTY UT nF4 Ultra-D Motherboard Review

  • Video Cards

    - Gigabyte GeForce 7600 GT 256MB Review
    - ASUS EN7900GT TOP 256MB Review
    - ASUS EN7600GT Silent 256MB Review
    - Biostar GeForce 7900 GT 256MB Review





  • SharkyForums.Com - Print: P4 Article

    P4 Article
    By Marsolin January 11, 2001, 12:03 PM

    http://www.systemlogic.net/articles/01/1/intel/

    I would like to hear what people think about this article. I have been looking into the implications of the caching decisions made on P4 and appreciated seeing someone elses take. It seems that the cache line width, like much of the rest of the design, will benefit high performance (future) applications to the possible detriment of current applications.

    By Galen_of_Edgewood January 11, 2001, 01:08 PM

    quote:Originally posted by Marsolin:
    http://www.systemlogic.net/articles/01/1/intel/

    I would like to hear what people think about this article. I have been looking into the implications of the caching decisions made on P4 and appreciated seeing someone elses take. It seems that the cache line width, like much of the rest of the design, will benefit high performance (future) applications to the possible detriment of current applications.

    That's pretty much what I caught on from the article. I'm waiting on Arcadian's take on it.

    Reading the first half of the article, I was begining to see the pluses of the P4, and the fact that it seems to have been designed with the future in mind.

    Then, I started reading the second half. Well, lets just say my opinion of Intel hasn't exactly been lifted any.

    They push MHz, but at the same time they don't. I don't like hipocracy....

    By Marsolin January 11, 2001, 01:19 PM

    I think the contradiction in messages stems from two areas. One: the target audience for Itanium is servers, which should mean the purchaser is more informed and more likely to look past clock speed. Two: Itanium was originally supposed to come out long before the P4 and Athlon had pushed clock speeds to such a high level and Intel is trying to put fingers in the dam until McKinley comes along with GHz+ speeds.

    I think the second reason carries the biggest weight.

    On another note, shouldn't you change your name to Aragorn with that sig.

    By Arcadian January 11, 2001, 04:57 PM

    I'd love to put my $.02 into this conversation. I must admit, I was both pleasently surprised at the detail of the article, and particularly unimpressed with the conclusions. I will spend the time, however, to address each of the points individually in a number of replies so that you don't get bored with just one. Read the replies that interest you, or read them all. But, there are a few things I want to say. My first point will be in my next post. Enjoy.

    By Arcadian January 11, 2001, 05:24 PM

    Point #1: Pentium 4 Small L1 Data Cache.

    Let me first say that the author is in error on one point. The cacheline size for the Pentium 4 processor is actually 128 bytes in length, not 64 bytes. This information can be found in this document.
    http://developer.intel.com/design/pentium4/manuals/245470.htm

    I can see where the author would be mistaken, since the P7 bus does transfer cachelines 64 bytes at a time, and since bus transactions can be out of order, the first 64 bytes can come later than the second 64 bytes. Eventually, though, every memory request replaces 128 bytes in the L2 cache, and then transfers that same amount into the L1 cache in regular inclusive fashion.

    The author makes a good point that the small 8KB of L1 does allow for a high miss rate in applications with erratic memory variables, but given large data arrays, large cacheline sizes can reduce the number of transactions to memory, which can severely slow down a computer.

    In order to see Intel's decision to use a smaller, faster cache, you have to invision the needs of tomorrow's computing. Ask yourself what programs you might be using a year or two from now. Do you think Word Processors need to be much faster? Do you envision Internet Explorer as needing more speed in computing an HTML web page (assuming no embedded media)? Of course not.

    Intel designed the Pentium 4 to process large quantities of streaming data. Intel envisioned the Internet as supplying the majority of data that your programs will crunch, and given the limited amount of bandwidth on even high speed Internet connections, data has to be designed to be streamable in small, compact quantities.

    The Pentium III and Athlon processors are much more fit for today's applications, which have erratic calculations, because many of today's compilers were not designed to take special consideration of how data is presented. The idea was for the Pentium 4 to be "as fast" as those processors in those kinds of applications. However, Intel wants the Pentium 4 to be the processor for the next 5 years, minimum. Therefore, it needs to be designed differently.

    Right now we can't say for certain that Intel made a mistake in their goals, since we haven't found many applications that take advantage of the Pentium 4 Processors different approach. However, if we think of the kinds of programs that would do well, we can see that the future may hold a place for that technology.

    Consider the benchmark called SPEC2000. This benchmark is particularly good for showing a processor's abilities given optimized code. In fact, Intel does tests using this benchmark as a base when they make decisions as to what new features they will implement. Not surprisingly, the Pentium 4 measures up to nearly the performance of the high class Alpha processor, and surely and soundly kicks the butt of any major x86 competitor. So why doesn't it perform well on the applications that we want?

    It's because certain design decisions had to be made as to what the Pentium 4 could and could not have at the end of the design period. Limited die space dictated that certain sacrifices needed to be made. Indeed, the L1 cache is said to be one of these sacrifices.

    Intel could have made their processor much faster than the competition, but as software needs changed, they would have had to start the process all over again. This redesign may have been necessary two years down the road, when competetor AMD launches their K8 series of processors. Certainly, a different processor would be necessary to compete with that. Instead of going the route of the follower, though, Intel chose the route of the leader, and made the decision to include advances that others would not have used.

    The Pentium 4 is a different approach because Intel believes that the market will need it with future applications that exist as streaming data over larger chunks of unpredictable data. For this to happen, though, complilers need to be changed, and programmers need to code their applications differently.

    Most people are under the impression that Intel cannot change the way programmers write their programs, but I believe Intel's vision is one where they don't have to. As I mentioned before, the Internet requires that programs run over the phone or cable wires be designed to allow for a low bandwidth connection. This kind of programming would benefit the Pentium 4 by nature. While right now, we don't have those kinds of programs, but Intel is currently trying to encourage that kind of programming style, and increase parallelism from a code standpoint.

    Whether they succeed remains to be seen, and whether the Pentium 4 finds more programs to separate it from the competition is also uncertain. However, given the scalability that Intel has in mind for the Pentium 4 suggests that they are willing to settle for lower performance now ofor higher performance in the future.

    I'll continue in my next post for different reasons why the Trace Cache may help.

    By Arcadian January 11, 2001, 05:50 PM

    2. The Pentium 4 Trace Cache

    The autor of the article seems to underestimate the size of the trace cache, and its importance to the Pentium 4 architecture.

    The Netburst architecture is the world's first hyper pipelined technology, or in other words, the first >20-stage pipeline. In order to support such a pipeline, certaihn legacy stages indicitive of x86 architecture needed to be eliminated.

    The trace cache's main function is to eliminate the need for a decoding stage as part of the main pipeline. In case you don't know, decoding is the process of turning full processor instructions into smaller macro-ops (mops, as the author uses). Macro-ops can be parellelized much easier than full instructions, and the RISC like engine in modern out-of-order exectution (OOOE) processors requires macro-ops. Macro-ops are also composed of one or more micro-ops. The Pentium 4 trace cache can hold up to 12,000 micro-ops.

    It is tough to compare this to other cache sizes for other processors. But, suffice it to say that you can fit a good portion of a program or critical loop into 12,000 instruction mico-op slots.

    The main thing to realize, though, is the order of these micro-ops. They actually are in the order of the program, and can take branches and jumps into account. The Pentium 4 branch predictor is very elegant, and is the best in the industry, so very rarely is there a cache miss in the trace cache. However, at certain points when misses occur, the Pentium 4 must go through several stages of decoding to get fresh new instructions from the L2 cache or memory.

    This does take a long time, but remember two things. One, that this happens in parellel to many other workings inside the processor, and two, that there are several execution pipelines that run from deep out of order queues that can feed the processor while more instructions are being grabbed from cache.

    The reminds me of two new points I need to bring up. The first is related to queues, and the second is related to cache bandwidth.

    The author tries to bring up the fact that the trace cache can only supply three micro-ops per cycle, yet the Pentium 4 has two double clocked ALUs, which can execute 4 micro-ops. This has been a myth of a statement that many have falsely claimed has been a caveat of the Pentium 4 design. The actual fact is that deep queues are put between both the trace cache and ALU. Since performance varies at both the trace cache and execution units, these queues are specifically for providing data when it is needed when either component cannot keep up with the other. It is because of these queues that the trace cache is not a bottleneck here, and I wanted to clarify that.

    The second thing about the cache bandwidth is another topic of annoyance to me, since many do not mention the fact that both L1 and L2 caches have a 48GB/s data transfer bandwidth. This allows caches to quickly fill in the case of a miss. It also makes it so that the processor is not waiting for new instructions or data when the information is not available.

    The Athlon processor has a pitiful cache bandwidth compared to the Pentium 4, so in many cases the Pentium 4 should easily out-maneuver it. The only reason it doesn't is simply because Intel severely underestimated just how chaotic current programs are.

    As I mentioned in the previous post, todays apps were not compiled with data management in mind. Since the Pentium 4 is very sensitive to even the smallest amount of unoptimized code, we frequently see a processor clocked 300MHz less easily outperform the Pentium 4. Will this change as more programs get optimizations? You bet!

    We have already seen a number of cases (and I made posts about these a couple months ago when the Pentium 4 launched) of optimized programs using Intel's compilers that receive a 400% increase in performance. As crazy as that sounds, it is true that easy optimizations can turn a slouchy Pentium 4 into a bullet train.

    These same optimizations also help other product lines as well, and that proves that it is a general fault of today's applications that are not programmed to take advantage of the way processors work. However, these other processor lines, such as the Pentium III and Athlon do not nearly get the full benefit as the Pentium 4 from these optimizations.

    Though, it will take a while for these kinds of applications to arrive, their coming is inevitable (remember, the Internet in Intel's vision is becoming the main tool for data transfer in computers of the future). Programmers will have to learn to program smarter, and the Pentium 4 will reap the benefits.

    My next topic will include more about crippling the Pentium 4.

    By Marsolin January 11, 2001, 06:29 PM

    Thanks for the posts. It's a good thing you broke them down. I'd hate for everyone to read long ones.

    Back to topic. I also noticed the 64 byte mistake, but forgot to mention anything. I'm glad to hear that I wasn't imagining things.

    I'm also glad that you pointed out about optimized code benefiting other processors as well. I've heard many people make the argument against companies optimizing for the P4 because it only benefits P4, but this is definitely untrue. The benchmarks that Tom's Hardware did with the optimized MPEG4 codec bear that out. P4 saw the greatest increase in performance, but some people from AMD claimed it would take them sometime to optimize the code any further for Athlon. All of us would benefit from optimized code.

    By Arcadian January 11, 2001, 06:56 PM

    3. On Die Multi-Processing and Rambus

    Both these topics will be short, so I thought I'd bunch them together.

    Towards the end of the trace cache discussion, the author extends the possibility that the Pentium 4 might allow for future multi-processing technoiques using a single die. However, the author is quite skeptical that this could ever come to pass.

    I disagree with this, because from my point of view, the Pentium 4 looks in many ways like it was designed with multiprocessing in mind. The author points out the most obvious, which is the write through nature of the caches, which indicates that cache sharing may go on between multiple logical processors on the same die.

    I myself have entertained the notion that Jackson Technology may be some kind of multithreading technique. I still believe this for many reasons. There are several different kinds of mutltithreading, the easiest of which to implement being coarse grained multithreading. The Pentium 4 actually lends itself to this kind of approach in more ways than one.

    For one thing, it has been shown that today's programs perform badly on the Pentium 4, and as I discussed earlier, it is probably because of the erratic nature of current applications. These kinds of apps cause a lot of cache misses in the CPU, which makes a lot of situations where the CPU will stall waiting for data. Even though there is plenty of hardware to reduce such waiting times, it apparently isn't enough, since the Pentium 4 does not perform quite as well as some would hope.

    Coarse grained multithreading would help to alliviate this predicament. The way this kind of mutithreading works is that when one execution pipeline in the processor gets a cache miss, and must wait extended times for data, a second program thread can run independantly with different data until such time as the first thread receives the information that it needs. This could possibily be a revolutionary way to deal with one of the fundemental banes of the x86 instruction set. I don't think that the author gives it nearly enough credit.

    As I see it, the Pentium 4 lends itself to multithreading much more than any other x86 design microarchitecture. Since Intel is highly academic in their new approaches to microprocessor design (look at the trace cache), I don't see how we can discount multithreading from being in the near future for the Pentium 4, possibly the Pentium 4 Xeon called Foster.

    As for Rambus, this has been a hot topic of discussion for more than a year. Ever since it first launched with minimal gains for astronomical prices, the world has given Rambus the cold shoulder. Now, in light of DDR memory giving equally small amounts of performance, critics everywhere are giving Rambus a second look.

    Clearly Rambus was just as much ahead of its time as the Pentium 4, and programs are simply not mature enough to use the extra bandwidth provided by Rambus memory. This can just as easily been seen in current DDR benchmarks, where DDR technology gives 0-5% on most tests. Both RDRAM and DDR SDRAM were supposed to offer new levels of performance through better memory bandwidths, but it is clear that applications are just not ready to use this new bandwidth.

    Because of lower customer demand for Rambus, though, Intel has clearly shifted its business direction, and whether or not Rambus gives the Pentium 4 extra performance, it is clear that Intel does not support it in the same way it used to.

    My opinion, though, is that a shift away from Rambus will not hurt Pentium 4 performance under current applications, but applications of the future may indeed feel an impact. Can these impacts be lessened through newer SDRAM technologies, I don't know, because RDRAM and SDRAM certainly have their pros and cons. Each technology has the ability to improve performance in different areas, and I think that the author of the article is writing off the Pentium 4 before further tests can be conducted.

    Certainly from today's standpoint, the move to SDRAM should not affect performance in a big way, mostly because applications are still built around yesterday's technology. As programs get optimized in the future, they will no doubt be optimized around the technology available at the time, and if that technology is Pentium 4 with SDRAM, then those configurations will get the highest performance.

    Keep in mind that SDRAM is capable of getting to the bandwidth amounts that RDRAM can reach, and technology around that interface is still improving. Rambus' success could just as easily fall prey to political issues that prevent it from being a dominate PC memory technology. Whether or not this happens, I think that Intel is braced to flow with the market in either direction. We'll have to see how things turn out, but my opinion is that SDRAM will not kill the Pentium 4.

    I still plan on writing a few more responses, so stay tuned.

    By Arcadian January 11, 2001, 07:21 PM

    4. Pentium 4 and Clock Frequency

    No one will argue that the Pentium 4 was built for clock speed. Though many argue that this was Intel's main consideration. I wish to disagree with that statement, as I have already shown that the Pentium 4 has much more technology than just for clock frquency. However, in this post, I want to concentrate on clock frequency, and expand on Intel's plan to leave other processor families in the dust.

    First, to get high speeds, you have to design a processor with plenty of small, optimized pipeline stages. The Pentium 4 Netburst architecture has at least 20 stages (there are more, but 20 is the minimum a single piece of data will run through in its course through the processor). As the author mentions, 2 stages of the Netburst architecture do absolutely nothing. This is completely correct.

    In an Intel presentation, these pipeline stages were labeled "Drive". Their main use was the following. Many engineers know that the worst obsticle to get around in a microprocessor are the wires, and the fact that each stage of the processor needs to connect cleanly and quickly to the next stage. Routing wires is important to making a fast clocking processor, and since the Pentium 4 has so many pipeline stages, it is so important to route the wires in a clean fashion.

    However, because of the sheer physical size of the chip, some wires had to be particularly long. In order to route the processor properly, Intel had to create a pipeline stage just so that the data can move from one side of the wire to the other. Remember, computers are limited by that awefully slow Speed of Light. The Drive pipeline stages are integral in getting the Pentium 4 to faster speeds.

    In addition, the Pentium 4 was built with timing as a special consideration, so that multi-GHz speeds were possible, even on a 180nm process (also called .18u). While the Pentium III tops out at 1GHz, the Pentium 4 is introduced at 1.5GHz. The author points this out, and I want to underscore it.

    Another nice thing about the Pentium 4 Processor's launch speed, is that it happened to have a lot of headroom. Many overclockers know that a good test of how much headroom a given processor has can be found by overclocking the processor.

    Overclockers have known for a long time that AMD's Duron processor can easily reach 1GHz levels, simply because they can overclock that high with simple air cooling. In addition, it should be no surprise that Intel's Celeron can overclock to the same speeds of the Pentium III, since they both share the same die. With Intel's current Pentium III stepping, that roughly is 1.1GHz.

    With the Pentium 4, however, overclockers have shown that it can easily be overclocked to 1.7GHz, or higher. Some overclockers agree that the Pentium 4 overclocking is limited by the Rambus memory, which is already clocked so high by nature. Intel processors are overclocked by adjusting the front side bus speed, which affects memory proportionately, and the Pentium 4 seems to have a decent ceiling for higher speeds.

    So what is preventing Intel from clocking the Pentium 4 higher? Well I certainly believe they intend to go for the sky with the Pentium 4, but they are currently testing for market acceptance with slower speeds, and making it so their current Pentium III lines do not get ignored by the majority of the market. This is especially crucial, since Pentium 4 sales are lmited by the short supply of RDRAM modules in the market. Intel cannot easily shift their millions of processor volume over to a new technology until their is hardware to support it.

    Intel has even claimed that 2.0GHz is possible given the current manufacturing process, and given the headroom in the early Pentium 4 chips, I wouldn't doubt that this is possible.

    Even more interesting, however, is Intel's new P860 process later this year, which shrinks transistors down to 130nm design rules. I am eagerly awaiting to see what kind of clock speeds the Pentium 4 can reach then. Since clock speeds are a factor in performance, it looks like Intel has the Pentium 4 covered for a while to come.

    I will continue to write more responses, so get ready to catch the next one.

    By Arcadian January 11, 2001, 07:51 PM

    5. Itanium a joke... I think NOT

    Oh where oh where do I begin? I think my main problem with this article, even more than the Pentium 4 discussion, was the Itanium down play.

    There are certainly a number of reasons to be concerned about Itanium. It has been delayed for a number of years, it is being suspiciously kept secret by Intel, and the few known facts seem to suggest that it is not all it's cracked up to be.

    OK... I will let the author rant about these. However, the author continuously brings up a number of myths that I just feel obligated to dispell. In addition, the author continues to draw unnecessary conclusions given his view of the subject matter. Well, this may be my take, but let's explore a few things that the author has decided not to mention.

    First of all, I want to talk about the Pentium III Xeon. That's right, the Pentium III Xeon. Unbeknowest to many people on the board, or at least I would imagine so, the Pentium III Xeon has enjoyed a rare spotlight that few processors have shared.

    For those that don't know, the Pentium III Xeon is very similar to a regular Pentium III, except for a few major differences. First and most obviously are the addition of multiprocessing capabilities and the addition of a rather sizable on die L2 cache. While the Pentium III has 256KB L2 and can support 2-way multiprocessing, the Pentium III Xeon supports up to 2MB of L2 cache (all on die), and up to 4-way glueless multiprocessing. By glueless, I mean on a single bus. One of Intel's chipsts called the ProFusion supports up to 8 processors, but that is through additional "glue" logic.

    One more difference, and this may be more important to some buyers than others, is price. Intel has successfully enjoyed a healthy markup on the Pentium III Xeon, and in some cases, a the large cache processors can cost up to 10 times the amount of a regular Pentium III Processor. Why the heck is this?

    This is because only through large on die caches can a processor system scale well in performance, and many businesses need a powerful system in a small amount of space, so rather than buy 4 Pentium III systems, they can buy one 4-way Pentium III Xeon system, that often times can still outperform the 4 Pentium III systems at the same clock frequency. How is that possible? Well to make an already long story shorter, let's just say that it is.

    Multiprocessing is crucial for a lot of businesses, and they will pay the heafty premiums for a Windows based system that can run their business reliably for a fraction of the total cost of ownership (TCO) of a similar RISC based system. Intel makes a good portion of its revenue from server processors like the Xeon, but one market they can't seem to penetrate is the high performance, high availability, large storage based "back end", that RISC based CPU manufacturers like Sun and IBM have enjoyed for many years.

    In order to get into this segment, Intel needs 64-bit memory addressing, and that's where IA-64 comes in. However, instead of settling for a 64-but x86 solution, Intel has realized that x86 can only go so far, before it hits a performance plateau.

    IA-64 promises to be a new generation in performance, but it still needs to get off the ground. It needs support, and it needs software. Intel has been working very hard in these areas, and that is one reason why Merced, the first version of the IA-64 processor, has been delayed so many times. Still, though, software takes time to develop, and platforms take time to build, so IA-64 is trying to make good first impression, while many consider it too long overdue.

    Some people claim that EPIC (the instruction set for IA-64 processors) may be the last new instruction set ever created. This is because the cost involved in putting out such a technology in the modern processor ecomony is too expensive, and anyone short of Intel would not be able to accomplish such a feat. This is because all other current architectures have shared years of prosperity in optimizing their architecture, so any new architecture, even one that can outperform the others given similar levels of optimization, can easily fail, since support for such an advancement takes so long that a loss for immediate gains may in fact doom the technology.

    This is the reason why so many are pessimistic about Itanium, because they look at the costs, and the competition, and wonder if Merced will be too little, too late.

    I don't think this is true, though, and I have plenty of reasons why. So far I have illustrated the success of the Pentium III Xeon, and given you the background behind Itanium and EPIC. But, since this post is already getting long, I will create one more to continue my discussion of why Itanium is better than the author of the article puts it out to be.

    By Arcadian January 11, 2001, 08:32 PM

    6. Itanium Continued

    I have taken a lot of time out to give my opinion on the article in question. I would like to continue, since it gives me satisfaction to put my opinions on these message boards. I hope some people are reading this, since a lot went into it, but at the least, it gets my thoughts written down, and I always like discussing computer architecture.

    On the topic of Itanium, though, I have already introduced you to the idea that the Pentium III Xeon has shared amazing success in the server market, even at its current high markup in price. The fact is that Itanium will occupy the former prices of the Pentium III Xeon, and sit on top of the places where the Xeon had previously stood.

    As far as price is concerned, and because Itanium will be occupying a number of segments currently occupied by the Pentium III Xeon, I find it unfair that the author compares Itanium to the Pentium 4, and other RISC architectures. It is true that the Itanium competes with other RISC architectures, but for that matter, so does the Pentium III Xeon. And in many cases, the Pentium III Xeon has proven superior. It is also true that Itanium does not share some of the technical advances of the Pentium 4, but is that reason to condemn the Itanium? The Pentium III Xeon does not share these advantages, yet it sells quite well. Therefore, I am going to attempt to compare the Itanium to the Pentium III Xeon, and this should be a fair comparison, since the Xeon does share a small market share in the "back end".

    First, I want to talk clock speed. The author makes a big speal about Itanium's misfortune at having a measly 800MHz clock frequency maximum. Compared to the Pentium III Xeon, however, this is an upgrade. That's because although the Pentium III comes in flavors up to 1.0GHz in frequency, the Xeon with large L2 cache only clocks up to 700MHz. Remember that the Pentium III Xeon still sells quite well, and in fact outperforms many RISC designs. Therefore, given clock frequency alone, the Itanium is an upgrade from the Pentium III Xeon, and clock frequency surely better not be used as a caveat.

    To also put things in a different perspective, let's consider Itanium's major RISC competition. The Alpha processor, which some consider the fastest processor in the world, tops out at 866MHz, and even those are in extremely short supply. The next largest speed grade is 733MHz. The PA-RISC processor tops out at 600MHz, and the more common frequencies are even lower. IBM's RS64 chip tops out at 800MHz as well. Finally, Sun's UltraSparc tops out at 900MHz, but more common frequencies include 800MHz and 700MHz. Therefore, from a frequency perspective, Itanium is not far behind, and in many cases it is ahead!

    Second, I want to talk about bus bandwidth. Both Itanium and Pentium III Xeon use 64 bit bus to cache memory. The Pentium III Xeon, however, uses the older P6 bus technology that runs at 100MHz. Therefore, it has 800MB/s of front side bus bandwidth. The Itanium has a 133MHz double data rate (DDR) bus for an equivalent 266MHz. The author condemns this technology, since he compares it to the Pentium 4. While the Pentium 4 enjoys copious amounts of bus bandwidth, the Itanium is no slouch. It happens to have a bus bandwidth of 2.1GB/s, which is about 2.5 times faster than the Pentium III Xeon bus. Again, compared to the Xeon, the Itanium seems faster in many respects. Therefore, why wouldn't it be successful in a similar market segment as the Xeon?

    The author also seems to suggest that the 2.1GB/s may not be enough for 16-way configurations, as have been planned for Itanium. Well, the Pentium III Xeon has been successsful with 800MB/s of glueless 4-way support, and to my knowledge, the Itanium will also support only 4 processors in a glueless situation. For 16-way configurations, I believe that OEMs will deisgn their own solutions with larger front side bus bandwidths. Therefore, the conclusion made by the author is not really effective in condemning the Itanium Processor.

    The author has also argued that Intel's secretive nature regarding SPEC scores suggests that they are sub par. Of all the idiodic conclusions... I believe the author is really reaching here. Intel is secretive about a lot of things. SPEC scores give competetors an advantage, since they will know the performance of the Itanium chip. Therefore it is to Intel's advantage to keep these secret. To assume the thier esoteric nature is the result of a failure of a product is an illogical conclusion.

    I can see where the author makes this mistake, though. It is highly regarded in the engineering community that the compiler technology neccessary to give optimal performance to EPIC based computers is not possible with today's programming. While that may be a highly regarded opinion, I believe that is all it is. I don't think Intel would go so fast forward with the IA-64 architecture if they didn't think they could program for it.

    Surely, without software, hardware is useless. And, in the case of Itanium, bad compilers = bad performance. However, I believe it is too early to make that assumption, and Intel has already shown that their compiler technology is already very advanced. Their VTune analyser is a program for compilers that analyses code, and optimizes for it. Additionally, their version 5.0 compiler gives serious gains in performance for Pentium 4 Processors, as well as others.

    Intel has been always on the cutting edge of compiler technology, and there is no reason to believe otherwise for Itanium. Where Itanium and the Pentium 4 differ, however, is that Itanium is a new architecture, and Intel still enjoys the luxury of making things up as they go along. Intel can set compiler standards early, for example, to make sure that developers get the most performance out of an Itanium chip.

    By architecture alone, the Itanium should run circles around the Pentium III Xeon. Memory bandwidth and front side bus bandwidth will also ensure that the Itanium gets plenty of data to churn through. I have heard that Intel will probably be going with standard PC133 memory for Itanium, so we already know it will be inexpensive and allow for large memory configurations.

    By interleaving the PC133 memory, Intel can get peak bandwidths of DDR or Rambus memory, while still maintaining the low latency of standard SDRAM. The author makes a mistake of assuming that Intel will pair the wrong memory technology with Itanium, but I don't think this will be the case.

    The author also speaks about EPIC being an "in order" architecture. While this is true, the author should not make the comparisons he does, since EPIC behaves very differently than a CISC "in order" architecture. With EPIC, instructions are highly parellelizes, and have no branches (this is because of predication and speculation... two hardware techniques I won't go into unless I get a request). With CISC, instructions are not very parellelized, and load/store instructions are rampant, which can cause many cache misses. Also, CISC instructions tend to allow for a lot of branches.

    Since EPIC is so different, it is unfair of the author to condemn its "in order" nature. Performance will probably turn out to keep up with current RISC designs, even with the first generation Merced design. The future McKinley design offers an even greater performance through a more experienced and well thought out EPIC IA-64 design. Therefore, I am thinking that Itanium should be fairly successful at first, and even more successful as time goes on.

    Of course it will still take a while before enough software and platforms support Itanium, but already OEMs are very excited. Since Itanium is a joint effort between Intel and HP, expect a number of HP Itanium systems. Also, Compaq, IBM, Silicon Graphics, Unisys, NEC, Dell, and others have already announced such support. With all it has going for it, it is unlikely that Intel can wretch defeat from the jaws of victory.

    This is all I'm writing for now, but I would love to continue if more people wish to hear more. Thanks for reading this, and I welcome any comments.

    By Humus January 12, 2001, 08:52 AM

    Well, I didn't have the time to read everything, but I have a few comment on the stuff I read.
    I don't think for a moment that future application will be more "streaming". Rather the opposite, more branches, more pointer jumping, more non-linear memory accesses.
    Simply because performance has become less and less important. Most programmers don't know and don't care much about the internal structure of the different chips, and believing that programmers will change their programming style because of new processor architecture is too naive, it simply wont happend.

    By Arcadian January 12, 2001, 10:58 AM

    quote:Originally posted by Humus:
    Well, I didn't have the time to read everything, but I have a few comment on the stuff I read.
    I don't think for a moment that future application will be more "streaming". Rather the opposite, more branches, more pointer jumping, more non-linear memory accesses.
    Simply because performance has become less and less important. Most programmers don't know and don't care much about the internal structure of the different chips, and believing that programmers will change their programming style because of new processor architecture is too naive, it simply wont happend.

    I don't believe I was suggesting that at all. It seems that you must have missed an important thing that I wrote, since you claim you didn't read everything. I was talking about Intel's vision when designing the Pentium 4.

    What I believe is that they are planning on the future of applications as being Internet based. In other words, the executable of the application will be on your hard drive, but the data will travel over the Internet. Such applications already exist. Examples are Real Audio/Video, Winamp's Shoutcast, and those VRML viewers of yesteryear. I believe Intel's vision is that these kinds of applications will become so popular, that they will be the majority of applications in the future.

    It doesn't take a lot of imagination to imagine graphics or 3D applications running plugins or getting mesh data from online sources, and then compute that data as it receives it. You may also be using your Office applications on a web page, and such apps already exist (it's just a matter of time before Microsoft follows suit). Therefore, it's not hard to imagine that Intel's vision could become a reality... could.

    Based on the assumption that the above is true, I made the following assumption. In order to optimize applications around the Internet, you need to take into account that the data you will be receiving is coming in over a connection with limited bandwidth. Programmers wishing to write such applications must write them in a streaming manner. They simply have to allow data to come in small, compact amounts, and the applications have to quickly compute each packet in this streaming fashion. Since these applications would really favor the Pentium 4, and those applications would very likely be in the future, I made the assumption that programmers would be optimizing programs on their own, without much convincing from Intel.

    However, Intel isn't going to sit down and wait for this revolution to take place. They have made compilers and performance analysers to help developers make programs like these, and they will continue to provide the means to make this transition easier.

    To summarise, through both the direction of software, and through Intel's help, applications will be changed to favor the Pentium 4. To think of it better, though, the Pentium 4 was designed to take advantage of future applications, not future applications being changed by Intel to support the Pentium 4. Hope this helps.

    By Humus January 12, 2001, 06:04 PM

    Don't know if you misunderstood me to have misunderstood you ... but anyway, I didn't suggest that you thought what I mentioned in my post, but rather I was talking about Intels strategy when designing the P4.
    Anyway, sure, the "webisation" (or what you should call it) of applications are a very real trend (which I partially find a little disturbing). But the cpu is not the bottleneck in this, and will probably not be in the overseeable future. Sure, MPEG4 decoding is a heavy task, and viewing high quality DivX movies fullscreen takes quite a lot of cpu power and will need like a GHz machine to be smooth. But most of the media streaming over the net is not and will probably not be very cpu intensive. You cannot continue to make compress video streams to infinity with complexer and complexer algoritms. Sooner or later it wont matter anymore since the speed of the net is already sufficient so there's no need to compress it anymore. I'm on a 10Mbit line myself. Streaming MPEG4 movies would be no problem. In just a few years almost everyone will have broadband connections (at least in sweden if the governments plans are coming trough (they are on good way)).
    Also, the "webisation" will not affect all applications. Most will probably not get more than a web-update function, where the download speed is all the matters.
    Anyway, P4 perhaps running faster on future applications is a good thing, but it shouldn't be payed with the cost of lower performance on todays applications. And a lot of people are still using old compilers.

    By Marsolin January 12, 2001, 07:28 PM

    Is slightly slower performance than an Athlon 1.2 GHz on some of today's programs really a problem? They aren't slow, just not the fastest available, so most people will see a performance improvement. I think we need to consider why people upgrade and that's what Intel tried to do. Obviously some people disagree with that, but I think that is simply taking a short sighted view by focusing too much on current benchmarks.

    The true success (or failure) of the design will emerge when benchmarks appear that grind the 1 GHz processors to a halt. Many benchmarks out there don't even really stress a processor these days.

    Those tasks that make us sit at our computers and wait are what will cause most people to go out and buy a new system. Streaming, video rendering, and compression are a few types of applications that do cause people to upgrade.

    By Marsolin January 12, 2001, 07:40 PM

    Itanium Bandwidth

    I agree with Arcadian about the author's statements regarding Itanium's system bus bandwidth. Compared to the current P4 implementation it is slightly slow, but it will improve. You also have to take the higher cache amounts into account when considering bandwidth needs. If you have four processors with 4 MB of L3 cache (the most Itanium supports) the need of each individual processor for that bandwidth must decrease compared to only 256 kB of L2 cache.

    Memory bandwidth will also not be a problem, despite the use of single data rate SDRAM. The peak memory bandwidth on Intel's reference systems is either 2.1 or 4.2 GB/sec, depending upon the implementation. As those get upgraded to DDR channels for McKinley (as most people expect) the situation will only get better. I'm not sure how this compares to Itanium's RISC competitors, it will certainly outstrip all x86 rivals that I'm aware of.

    By Arcadian January 12, 2001, 07:57 PM

    quote:Originally posted by Marsolin:
    Is slightly slower performance than an Athlon 1.2 GHz on some of today's programs really a problem? They aren't slow, just not the fastest available, so most people will see a performance improvement. I think we need to consider why people upgrade and that's what Intel tried to do. Obviously some people disagree with that, but I think that is simply taking a short sighted view by focusing too much on current benchmarks.

    The true success (or failure) of the design will emerge when benchmarks appear that grind the 1 GHz processors to a halt. Many benchmarks out there don't even really stress a processor these days.

    Those tasks that make us sit at our computers and wait are what will cause most people to go out and buy a new system. Streaming, video rendering, and compression are a few types of applications that do cause people to upgrade.

    Agreed. The Pentium 4 doesn't make sense if you were buying a computer today, and Intel will likely only sell well in retail (where most buyers don't know the difference between a hard drive and a video card). Later on, though, we'll get to see if Intel's crystal ball was correct, and they did indeed design a worthy 7th generation microprocessor. It's pretty clear that Intel is easing the Pentium 4 in slowly, so as to not flood the market. I'm interested in seeing how things develop, though.

    By Elxman January 12, 2001, 10:17 PM

    phew dang all that reading, I must say that it is very informative and unlike most sites condemning the p4 for reletively sub-par performance, your articles do give more information weather to why that is. and I agree with you that programming should not be done in a erratic way. I'll just refer to my exp. and doing 1 subject of hw at once is easier to do than doing half of 1 subject and another half of some other subject.
    although the cpus can process info must faster you see the point . I think p4 is kinda like the radeon in a way, innovative feats that's most likely be used in future apps.

    By Xcom_Cheetah January 13, 2001, 02:30 PM

    piece of engineering... how can it be possible..?? though there can be some wrong decission but not of this level.. Intel know that their next atleast 5 yrs of growth and revenues are heavily dependant on this P4 architecture so how can they make such flop processor (according to some reviewing sites.) ??? but their are couple of points i will like get clear (if anybody can help)
    firstly why intel didn;t included another FPU.. i mean AMD has 3 so intel i think should have gone with 2..??
    Secondly wots the branch prediction ratio of P4.. i think i read on aceshardware review of P4 that it is not upto the claims intel are making... and is not better than PIII.. is it true.??
    thirdly wot will be the possible addition/subtraction in the P4 architecture when it makes its transition to .13 micron process...??
    lastly will not P4 performance be suffering if it is paired with DDR SDRAM cuz it is specifically build around RDRAM technology..??
    Also
    Following two links shows a comparisson between P4 and Athlon and itanium.. http://www.aceshardware.com/Spades/read_news.php?post_id=20000322 http://www.netlib.org/atlas/atlas-comm/msg00187.html
    In one comparison P4 shows quite a performance... and in other one it is totally opposite picture... can u plz explain it.... also which one is more near to realtime application performance...??

    By Arcadian January 13, 2001, 03:08 PM

    quote:Originally posted by Xcom_Cheetah:
    piece of engineering... how can it be possible..?? though there can be some wrong decission but not of this level.. Intel know that their next atleast 5 yrs of growth and revenues are heavily dependant on this P4 architecture so how can they make such flop processor (according to some reviewing sites.) ??? but their are couple of points i will like get clear (if anybody can help)

    Well, I'll be happy to answer your questions. Thanks for responding to this, too. I know that the Pentium 4 has plenty of caveats right now, but I think it will become more compelling in the future. Instead of reiterating what a said in my longer posts, though, just let me answer your questions.

    quote:firstly why intel didn;t included another FPU.. i mean AMD has 3 so intel i think should have gone with 2..??

    I believe I read somewhere that Intel engineers in charge of the Pentium 4 essentially ran out of space, and decided to cut back on some features in order to have the size of die that they were confortable with.

    The thing is, though, that SSE optimization can increase performance well above adding newer FPU pipelines. I'm sure that Intel was in the position of choosing between SSE performance or going with more FPU pipelines. The decision at the end was to make SSE higher performance so that they can save room and eliminate the second FPU pipeline. Whether this was a good or bad decision depends on when optimized applications arrive for the Pentium 4.

    quote:Secondly wots the branch prediction ratio of P4.. i think i read on aceshardware review of P4 that it is not upto the claims intel are making... and is not better than PIII.. is it true.??

    Absolutely not! The Pentium 4 branch predictor is the best of any processor that exists. However, the penalty for mispredicted branches also happens to be larger than any other processor. Therefore, that cancels out most of the benefit from the excellent branch predictor. Overall, I believe the Pentium 4 is slightly worse then the Pentium III on average at branch testing, whose result depends on the product of the branch predictor percentage, and the branch misprediction penalty.

    quote:thirdly wot will be the possible addition/subtraction in the P4 architecture when it makes its transition to .13 micron process...??

    Intel can certainly add a lot more in the Pentium 4 with smaller design rules, but don't expect this to happen overnight. The fact is that each new design change requires massive validation and debug efforts. There is no doubt that Intel is probably working on such a design that offers more logic in the Pentium 4 at smaller design rules such as .13u, but this will probably not be available this year.

    Intel will be releasing a .13u version of the Pentium 4, code named Northwood, later this year. But besides offering much higher clock frequencies, I don't believe it will be much different than the current Pentium 4. I believe it may include more L2 cache, though, which should increase performance 5-10%, at most on some applications.

    quote:lastly will not P4 performance be suffering if it is paired with DDR SDRAM cuz it is specifically build around RDRAM technology..??

    Again, this will depend on the application. I believe that DDR SDRAM has it's own advantages and disadvantages, and performance will differ depending on the programming. Personally, I don't think getting rid of RDRAM will impact Pentium 4 performance in a big way.

    quote:Also Following two links shows a comparisson between P4 and Athlon and itanium.. http://www.aceshardware.com/Spades/read_news.php?post_id=20000322 http://www.netlib.org/atlas/atlas-comm/msg00187.html
    In one comparison P4 shows quite a performance... and in other one it is totally opposite picture... can u plz explain it.... also which one is more near to realtime application performance...??

    I actually just posted a topic discussing the second link. It is part of the Itanium preview topic. Check out what I had to say about it there.

    Also, any feedback and comments are appreciated.

    By Xcom_Cheetah January 15, 2001, 08:16 AM

    Thanks for the reply... but one thing remain unsolves is wot is the actual number of branch prediction rate... like PIII has 92% and AMD thunderBird around 93-94% and best was of K6-2 i.e 95% .. so is branch prediction of P4 equal or greater than 95%..??

    Lastly a little offtopic... how much AMD can stretch the Athlon core with every bit of tweaking..... will the 2.0 Ghz be the limit.. and so how long the P4 core can go.. anywhere near 10Ghz or not..??

    By Arcadian January 15, 2001, 11:54 AM

    quote:Originally posted by Xcom_Cheetah:
    Thanks for the reply... but one thing remain unsolves is wot is the actual number of branch prediction rate... like PIII has 92% and AMD thunderBird around 93-94% and best was of K6-2 i.e 95% .. so is branch prediction of P4 equal or greater than 95%..??

    Any numbers that you read about are essentially meaningless. Branch prediction depends heavily on the sord of code used, and this can vary by a long shot, depending on the application. The only standardization among branch prediction available is through the compiled versions of SPEC2000. SPEC2000 is an industry wide benchmark that companies use to evaluate their performance on a level playing field. Although SPEC2000 (like any benchmark) is hardly perfect, it does give a good idea of performance of various processor functions. Using SPEC2000 to measure branch prediction, Intel claims that the Pentium 4 can outdo any other processor on the market. More specific information is simply not available.

    quote:Originally posted by Xcom_Cheetah:
    Lastly a little offtopic... how much AMD can stretch the Athlon core with every bit of tweaking..... will the 2.0 Ghz be the limit.. and so how long the P4 core can go.. anywhere near 10Ghz or not..??

    I believe the Thunderbird, as we know it today, has hit its maximum clock speed at 1.2GHz. Further tweaking will probably get a little more out of it, and Palomino may in fact get a little more. AMD claims that they can achieve 1.7GHz in 2001 before they move to their .13u process. However, I believe that even 1.7GHz cannot be achieved without a "Half shrink" to .15u. These are just my perceptions, and I could be wrong, though.

    By awa64 January 15, 2001, 12:03 PM

    1.7GHz? I doubt that with the current process that's possible. If it was, you'd need HUGE heatsinks to do it.

    By Marsolin January 15, 2001, 12:19 PM

    quote:Originally posted by Arcadian:
    Intel will be releasing a .13u version of the Pentium 4, code named Northwood, later this year. But besides offering much higher clock frequencies, I don't believe it will be much different than the current Pentium 4. I believe it may include more L2 cache, though, which should increase performance 5-10%, at most on some applications.

    You are correct here, Arcadian. The only major change to Northwood will be a doubling of the cache to 512kB. The other changes, like lower power consumption, will just be a result of the die shrink.

    Tualatin was also originally supposed to feature a doubled cache, but is has since been dropped back to 256kB because it won't be sold as a performance system.

    By Conrad Song January 15, 2001, 12:33 PM

    quote:Originally posted by Xcom_Cheetah:
    firstly why intel didn;t included another FPU.. i mean AMD has 3 so intel i think should have gone with 2..??

    You know, this is pure marketing hype. Look carefully and Athlon's 3 FPU units consist of a: one LD/ST, one FADD w/ MMX, one FMUL w/ MMX. Compare this to Pentium 4's two: one LD/ST, one FADD/FMUL.

    By m538 January 15, 2001, 08:35 PM

    1. Wow, two stages of pipeline do nothing. They just kick charges from one side of chip to another. When Intel introduced Slot 1, people started talk that CPUs' progress is incorrect, but when AMD also introduced Slot A things became normal. I personally think finally all processors will be at slots (for example four CPUs on a horizontally placed main board) dropped in a cooling liquid.
    Now you are nervous about 10% performance waste, year later you will talk about "drive percentage".

    2. As I see it, Intel JUST didn't completed so revolutionary product (Pentium 4) with appropriate cache size. Now they probably pay moneys around the world to peoples for writing defensive articles. "Pentium 4 was designed for programs of future, for optimized code..." More likely programming of streams suffers from tiny cache less than another 90% of programming. I just love the next words originally posted by Humus:

    I don't think for a moment that future application will be more "streaming". Rather the opposite, more branches, more pointer jumping, more non-linear memory accesses. Simply because performance has become less and less important. Most programmers don't know and don't care much about the internal structure of the different chips.

    More multithreading also, and I am very skeptical of Jackson technology will solve multithreading of the FUTURE. Real multiprocessor solutions aren't for common market. Furthermore, it seems price-to-price Athlon (1.6 GHz for example) will "slider" outperform Pentium 4 (1.4 GHz for example) with modern erratic programming, but it will be not surprising if Athlon will slider outperform Pentium 4 in every stream task except those designed for SSE2.

    I suppose most successful part of Pentium 4 processor is... RDRAM memory. It alone helps entire system not to be overwhelmed by modern programs, especially Windows. But with higher-clocked processors RDRAM will not do job cache must do. Where 1Mb of level 3 cache, where 16K of level 1 cache? Intel doesn't like caches, but programmers like. And I can say why, because optimized and unbuggy codes are mutually exclusive. Not only legacy makes Windows huge, also style of programming that theoretically ensures your code has no bugs, and be sure that style reduces bugs' number many times. OK, MPEG4 decoder will fit in 12,000 mops in this year, but I must say one thing in case you don't know: there are many another useful and CPU-hungry applications that never will do. And one of them (coming soon) is erratic just like humans, it is AI.

    By Thoth January 16, 2001, 03:49 AM

    It sounds to me that programing for the P4 could be more work that it's worth, maybe Intel made the P4 to soon.

    By Arcadian January 16, 2001, 10:25 AM

    quote:Originally posted by Thoth:
    It sounds to me that programing for the P4 could be more work that it's worth, maybe Intel made the P4 to soon.

    More work than it's worth? Actually, there are compilers that do it for you. Intel sells them. You simply recompile your code, and you get as much as 3x performance. Isn't that worth it?

    By Humus January 16, 2001, 10:27 AM

    quote:Originally posted by Arcadian:
    More work than it's worth? Actually, there are compilers that do it for you. Intel sells them. You simply recompile your code, and you get as much as 3x performance. Isn't that worth it?

    *cough* 3x in cases where the code contains a lot of parallelism etc. In general you get less. But it's probably the best compiler.

    BTW, congrats to you 1024th post Arcadian!

    By Arcadian January 16, 2001, 11:10 AM

    quote:Originally posted by Humus:
    *cough* 3x in cases where the code contains a lot of parallelism etc. In general you get less. But it's probably the best compiler.

    BTW, congrats to you 1024th post Arcadian!

    Well, be that as it may, if a program even gets 20-30% performance, than it would be worth recompiling, for the minimal amount of time and money that it takes. From there, the gains can only get much more significant.

    BTW, thanks for the congratulations.

    By Xcom_Cheetah January 16, 2001, 02:01 PM

    Btw can u tell wots the advantage of double pumped ALU... is it just give 4 ALU in the size of 2 ALU or is it give some xtra performance boost..?? Actually i think this is the least discuss point of the P4 (atleast i haven;t read much abt it..) and if it give some kinda xtra performance than how much it is roughly..??
    Secondly wouldn;t it be better to increase L1 cache rather than L2 cache.. although i think that adding L2 cache is a lot difficuilt thing but still 8K of L1 cache is just tooo little ... isn;t it..??

    By Marsolin January 16, 2001, 02:11 PM

    quote:Originally posted by Xcom_Cheetah:
    Btw can u tell wots the advantage of double pumped ALU... is it just give 4 ALU in the size of 2 ALU or is it give some xtra performance boost..?? Actually i think this is the least discuss point of the P4 (atleast i haven;t read much abt it..) and if it give some kinda xtra performance than how much it is roughly..??

    I think you have it right that it gives more effective ALU's. It really helps to save die size, but without sacrificing speed.

    By Marsolin January 16, 2001, 02:17 PM

    quote:Originally posted by Xcom_Cheetah:
    Secondly wouldn;t it be better to increase L1 cache rather than L2 cache.. although i think that adding L2 cache is a lot difficuilt thing but still 8K of L1 cache is just tooo little ... isn;t it..??

    The L1 data cache is as small as it is because of latency. It order to keep the latency to 2 clocks, Intel had to make the L1 data cache small. They must have felt that the tradeoff was worth it.

    By Arcadian January 16, 2001, 04:20 PM

    quote:Originally posted by Marsolin:
    The L1 data cache is as small as it is because of latency. It order to keep the latency to 2 clocks, Intel had to make the L1 data cache small. They must have felt that the tradeoff was work it.

    Also, the L2 cache in the Pentium 4 is only slightly slower than the L1 cache of most other modern CPUs. Since the L2 to L1 bandwidth is very large, the L2 effectively feeds the L1 continuously. Obviously, this isn't as fast as having all L1 cache, but overall, it helps to maximize performance vs die size, and allow for a lot of growth headroom.

    By Moridin January 16, 2001, 04:52 PM

    quote:Originally posted by Arcadian:
    Also, the L2 cache in the Pentium 4 is only slightly slower than the L1 cache of most other modern CPUs. Since the L2 to L1 bandwidth is very large, the L2 effectively feeds the L1 continuously. Obviously, this isn't as fast as having all L1 cache, but overall, it helps to maximize performance vs die size, and allow for a lot of growth headroom.
    I was going to say It's L2 is faster then the L1 of all none X86 CPU's except some of the faster Alpha's. But I thought of a few more exceptions.

    By Xcom_Cheetah January 17, 2001, 08:57 AM

    wot does associative means when used with L1 cache ..? i mean sometimes it written as L1 is 4-way associative or 8-way associative.. can any please explain it to me..??

    thanks

    By Angelus January 17, 2001, 09:13 AM

    quote:Originally posted by Xcom_Cheetah:
    wot does associative means when used with L1 cache ..? i mean sometimes it written as L1 is 4-way associative or 8-way associative.. can any please explain it to me..??thanks

    You can find it on the same site as the P4 article: http://www.systemlogic.net/articles/00/10/cache/

    By Arcadian January 17, 2001, 11:13 AM

    quote:Originally posted by Xcom_Cheetah:
    wot does associative means when used with L1 cache ..? i mean sometimes it written as L1 is 4-way associative or 8-way associative.. can any please explain it to me..??

    thanks

    The easiest way to explain this is that associativity changes the way the space in the cache is devoted to data. Larger associativity means higher hit rates (better), but also higher latencies (worse), because of higher complexity.

    A 1-way associative cache, also known as Direct Mapped, means that when you read a cacheline from memory that maps to the same set in cache, it will erase the set that was already there.

    Higher associativity will keep the previous data. 4-way associativity, for example, means that 4 cachelines will be kept at one time, and a 5th cacheline read will evict the cacheline that wasn't used for the longest amount of time (in a best designed cache).

    Another variant of associativity is a Full Associative cache, in which the entire cache never evicts a line, unless it is completely full. Therefore, every cacheline in memory will map to the same set, and you will have the best hit rates. Fully associative caches only work well if it is an extremely small cache, as this method requires that cache lookups search through the entire cache in order to find data, which means very high latencies.

    If I have explained this so that it is confusing, let me know, and I'll try again, but I hope this helps.

    By Conrad Song January 17, 2001, 11:14 PM

    quote:Originally posted by Xcom_Cheetah:
    wot does associative means when used with L1 cache ..? i mean sometimes it written as L1 is 4-way associative or 8-way associative.. can any please explain it to me..??

    thanks

    Associativity refers to the ability of a memory address to be cached at a block. Let's get the terminology down:

    Each cache is divided into fixed size chuncks called blocks. Blocks are grouped based on the associativity into sets. Thus for a 4-way associative cache, there are 4 cache blocks per cache set.

    The rule is: for any memory address, the data at that address is only allowed to be cached into one cache set. That is, only one cache set can potentially hold the contents of memory address.

    Therefore, if I'm looking for address 0x12ff340, there is only one cache set that I need to search to see if it resides in the cache. If my cache is 4-way associative, then I will have to search through 4 blocks to see if it resides in the cache. In a 8-way associative, I have to search through 8 blocks.

    Typically, because of temporal locality, the smaller the set-associativity the lower the cache hit-rate. However, higher associativity adds more transitors and takes longer to search for a cache hit.

    Note that the law of diminishing returns applies here. So, going from 1-way to 2-way shows a much bigger improvement than going from 4-way to 8-way... usually...


    Contact Us | www.SharkyForums.com

    Copyright © 1999, 2000 internet.com Corporation. All Rights Reserved.


    Ultimate Bulletin Board 5.46

    previous page
    next page





    Copyright © 2002 INT Media Group, Incorporated. All Rights Reserved. About INT Media Group | Press Releases | Privacy Policy | Career Opportunities