Original Link: https://www.anandtech.com/show/661
Intel Pentium 4 1.4GHz & 1.5GHz
by Anand Lal Shimpi on November 20, 2000 12:54 AM EST- Posted in
- CPUs
History has shown us that every 3 – 4 years Intel releases a new microprocessor architecture. The longest Intel has gone between micro-architecture generations was the four years between the release of the 8086 in 1978 and the 286 in 1982. If you look at the fact that the last micro-architecture Intel released was back in 1995 with the Pentium Pro (P6 micro-architecture) we are slightly overdue for a new micro-architecture from Intel.
We’ve been hearing about this new micro-architecture and the first processor to use it for quite some time now. We were first introduced to what was then called the Willamette core 9 months ago at the Spring 2000 Intel Developer Forum. Already running at 1.5GHz back then, this new architecture and the accompanying processor could be just what Intel needed to get back on track.
It’s amazing at how quickly the industry can turn from being dominated almost completely by a single CPU manufacturer over to a point where the underdog is now in a position to lead the market into the 21st century. Over the past 12 – 18 months we have seen this very situation occur right in front of our own eyes. Intel, a manufacturer never associated with delays or processor shortages and AMD, a manufacturer that was associated with sub-par performance and an inability to deliver on time, essentially switched roles in the past year alone.
Now AMD is at a point where they are being taken very seriously by the industry and Intel is in a position where they have to fight to regain a lot of lost ground. AMD’s weapon to get to the top has been their Athlon and focusing on the mainstream and performance market segments (the Duron has still yet to break into the retail value market segment). Intel’s Pentium III used to be its flagship, unfortunately it is stuck at the 1GHz mark until the end of the first half of 2001. This paves the way for Intel’s new micro-architecture, what they like to call the NetBurst Architecture, and the first IA-32 processor to make use of it, the Pentium 4.
Today Intel is introducing the first two members of the Pentium 4 family, the 1.4GHz and 1.5GHz parts, which not only mark the first two x86 CPUs that make use of the NetBurst Architecture but they are currently the two highest clocked x86 CPUs available. And today, our job is to explain the performance of the Pentium 4 and give you the thumbs up or the thumbs down you’re here for.
Let’s get started.
NetBurst: Architecture for the future?
A constant theme in the marketing surrounding Intel’s Pentium 4 is that its architecture is paving the way for the next generation of computing. This is true in more ways than one.
From the Intel marketing side this is telling you not to pay attention to the Pentium 4’s performance today, but pay attention to its performance down the road.
From our perspective this brings up the first advantage of the Pentium 4 over its predecessor, the ability to ramp up in clock speed. In fact, this is quite possibly one of the only ways the Pentium 4 will succeed, not at the “low” clock speeds it is being launched at, but down the road at higher frequencies which it should be able to attain reasonably well, at least when compared to the Pentium III it is replacing.
From the perspective of you, the consumer, this should raise a flag. If the NetBurst architecture is supposed to provide a path to the future of computing, does it make sense to adopt the architecture today? You don’t buy a CPU based on what its performance will be 6 – 9 months from now, you buy it for its useable performance today and with the hope that it will last you beyond that 6 - 9 month period.
Keep that in mind as we take a look at the Pentium 4 since we will be taking two stances on the CPU. One analyzing its forward looking performance and one as the position of the consumer, with money to spend now on a system. The former will help to give us a clue as to whether or not the Pentium 4 stands a chance as time goes on, and the latter gives us the recommendation you all are here for in the first place.
The pieces of the NetBurst Pie
As we diagramed in our August 2000 article on Intel’s NetBurst Architecture, the micro-architecture is composed of a handful of buzzwords that represent more complex features of the technology.
The NetBurst micro-architecture is comprised of, according to Intel, the four following new features: Hyper Pipelined Technology, Rapid Execution Engine, Execution Trace Cache and a 400MHz system bus.
In addition to those four new features Intel is boasting four new improvements over the P6 micro-architecture that NetBurst is replacing, these improvements, once again according to Intel, are as follows: Advanced Dynamic Execution, Advanced Transfer Cache, Enhanced Floating Point & Multimedia Unit, and Streaming SIMD Extensions 2. We’re going to tackle each one individually, and give you the heads up as to what these terms really mean and their impact on the performance, both positive and negative, on the Pentium 4.
Hyper Pipelined Technology
The problem Intel faced with the P6 micro-architecture was that they were in a situation where they needed to increase the clock speed of their P6 based processors, however without performing another die shrink on the core of the Pentium III they were already hitting the limits of that core. This was evident by the problems Intel encountered when attempting to produce a 1.13GHz Pentium III. Unfortunately, with their 0.13-micron fabrication process still months away from being ready to use on mass produced processors, Intel needs something else to help keep the clock speed up.
We have seen this trend throughout history. The Pentium Classic and the Pentium MMX, both based on the P5 micro-architecture, maxed out at 233MHz in desktop configurations and 266MHz in mobile setups. The Pentium Pro, a P6 processor, ended up reaching a 200MHz ceiling however moving the L2 cache off-die and later followed by a die shrink allowed its successor, the Pentium II (a P6 processor as well), to reach clock speeds as high as 450MHz.
It’s easy to recommend that a CPU manufacturer just employ a die shrink in order to increase clock speed, however it’s something easier said than done. Introducing a new fabrication process is quite expensive (those manufacturing plants cost billions to construct, you can imagine the cost on introducing a new manufacturing process) and it’s often times not a viable solution to the clock speed issue since it takes quite a bit of time to bring a new process up to speed and actually get the yields high enough to make it a profitable move.
This brings up that other method of increasing clock speed that we alluded to before. Instead of shrinking the die, why not make the CPU do less? If you make a CPU do less per clock, it’s able to ramp up to higher overall clock speeds. And theoretically, if you can get the numbers to work out, the sacrifice you make in terms of the amount the CPU can do per clock is more than made up for by the fact that this CPU of yours can now run at much higher clock speeds.
Doing “less” per clock is an oversimplified way of stating that you should increase the number of stages in the processor’s pipeline. The deeper the pipeline, the more stages an instruction must go through before reaching the end of the pipeline, thus you’re accomplishing less per clock. The original Pentium featured a fairly short pipeline by today’s standards, composed of “only” five stages. The Pentium Pro and later on the Pentium II/III made use of the P6 micro-architecture that featured a pipeline that featured twice as many stages, for a total of 10. The NetBurst micro-architecture that the Pentium 4 is based around doubles the length of the pipeline yet again, making it 20 stages deep.
This 20-stage pipeline is what Intel is calling their Hyper Pipelined Technology.
Remember how we mentioned that a redeeming quality of the Pentium 4 may be its ability to ramp up to higher clock speeds? The Hyper Pipelined Technology is how Intel is planning on doing just that.
So if increasing the depth of a processor’s pipeline is an easier way of paving the way for higher clock speeds, why not jump to a 100 stage pipeline right away? Just like most everything in life, there is a pretty big downside to this otherwise beautifully painted picture.
Modern day CPUs attempt to increase the efficiency of their pipelines by predicting what they will be asked to do next, this is a simplified explanation of the term Branch Tree Prediction. When a processor predicts correctly, everything goes according to plan but when an incorrect prediction is made the processing cycle must start all over at the beginning of the pipeline. Because of this, a processor with a 10 stage pipeline has a lower penalty for a mis-predicted branch than that of a processor with a 20 stage pipeline. The longer the pipeline, the further back in the process you have to start over in order to make up for a mis-predicted branch.
This puts Intel in an interesting situation, they have to make the tradeoff of a bigger penalty for a mis-predicted branch in order to reach higher clock speeds. In order to lessen the penalty that must be paid for a mis-predicted branch the next two features of Intel’s NetBurst micro-architecture come into play.
Rapid Execution Engine
Just two hours into his presentation at the Spring 2000 IDF we were floored when Albert Yu mentioned that the Pentium 4’s Arithmetic Logic Units (it has 2) would operate at twice the operating frequency of the CPU. Meaning that for the 1.5GHz Pentium 4 that Mr. Yu was demonstrating, the integer units would be running at 3GHz. At a first look this would seem to indicate that the Pentium 4 and all other NetBurst based processors would have the absolute highest level of integer performance.
With the Pentium 4’s integer units running at 3GHz, how could AMD even begin to compete?
Fortunately for AMD, Intel had to double pump (Intel’s term for the 2x clocking of the units) the ALUs in order to deliver integer performance that was at least equal to that of a lower clocked Pentium III. Confused? Let’s take a look at the Hyper Pipelined Technology behind the Pentium 4 again, and this time let’s see how it affects integer performance.
The Pentium 4 has a very advanced branch predictor that can help to avoid any mis-predicted branches that may occur in the later stages of its pipeline. The Pentium 4’s branch predictor is actually much more advanced than the Athlon’s, unfortunately regardless of how advanced it is, you can’t predict something that is generally unpredictable. This is the case when it comes to integer instructions.
The nature of integer instructions is that predicting branches when dealing with these type of operations is quite difficult. In many cases, when dealing with these integer instructions as you would when running many business/office level applications, the Pentium 4’s branch predictor will mis-predict a branch sending the instructions back to the start of the 20 stage pipeline. This penalty is huge compared to what it would be on the Pentium III since it only has a 10-stage pipeline.
Because of this, Intel has been playing down the necessity for high integer performance. If you recall, this is actually the second time they have done this, the first was with the original Celeron which was cacheless and thus performed quite poorly in most integer applications.
While they are correct in stating that performance under Microsoft Word is much less critical than performance under 3D Studio MAX or Quake III Arena for example (since the limiting factor becomes how quickly the user can input data), if you remember from the days of the original Celeron, the business/office user community was quite disappointed in the processor because the benchmarks showed lackluster integer performance.
Since the ALUs are double pumped, as the Pentium 4’s clock speed increases, the integer performance of the processor should begin to distance itself from the Pentium III since for every 100MHz increase in clock speed the ALUs effective operating frequency will increase by 200MHz.
Apparently other portions of the Pentium 4 are also double pumped, when combined with the double pumped ALUs you can see a clear trend towards achieving lower latencies in certain parts of the CPU.
The Pentium 4’s Cache
We mentioned that there is another “trick” Intel implemented to nullify some of the penalties associated with having a 20-stage pipeline. We just discussed the benefits or rather the necessity of double pumping the Pentium 4’s integer units among other parts of the CPU, now it’s time to talk about another feature of Intel’s NetBurst micro-architecture.
The Pentium 4’s branch target buffer is eight times as large as that of the Pentium III, this is the area in which the branch predictor gathers its data that is used to predict branches. This is part of why the Pentium 4 has such a high prediction rate, but even taking that into account, the percentage of mis-predicted branches (as small as they may be) can seriously impact performance.
We mentioned in our article on Intel’s NetBurst micro-architecture that the Pentium 4 will feature a small 8KB L1 data cache. This is exactly half the size of the L1 data cache of the Pentium III (16KB), so why the reduction in size? Smaller caches have lower latencies so in part it was an attempt to decrease the latency of the L1 cache. In comparison, while the Athlon’s 2-way set associative 64KB L1 Data Cache has a better hit rate (larger caches have better hit rates) it has a 50% higher latency (3 clocks vs 2 clocks).
Unfortunately not all programs can fit in this L1 cache, so the Pentium 4’s L2 cache comes into play and must be fairly low latency for performance sake. We know from the introduction of the Pentium III’s Coppermine core that Intel’s on-die L2 cache is superior to that found on the Athlon’s Thunderbird core. The reason behind this is that the L2 cache has a much wider data path on the Pentium III than on the Athlon (256-bit vs 64-bit on the Thunderbird). With the Pentium 4, the L2 cache subsystem gets even better.
Again, remember that Intel’s goal here is to reduce latency while keeping cache hit rate high. By taking the Pentium III’s L2 cache and allowing it to transfer data on every clock, the Pentium 4’s L2 cache is a lower latency and higher bandwidth L2 cache than the Advanced Transfer Cache found on the Pentium III. At 1.5GHz, the Pentium 4’s L2 cache offers a 48GB/s throughput while a theoretical 1.5GHz Pentium III would only offer 24GB/s of available bandwidth. In comparison, a 1.5GHz Athlon (Thunderbird core) would only have 6GB/s of available bandwidth to its L2 cache because of its 64-bit L2 cache data path.
Let’s get back to the issue of dealing with the possibility of a mis-predicted branch. A part of Intel’s NetBurst micro-architecture is the presence of what they’re calling an Execution Trace Cache.
The decoder of any x86 CPU (what takes the fetched instructions and decodes them into a form understandable by the execution units) has one of the highest gate counts out of all of the pieces of logic in the core. This translates into quite a bit of time being spent in the decoding stage when preparing to process an instruction either for the first time or after a branch mis-prediction.
The Execution Trace Cache acts as a middle-man between the decoding stage and the first stage of execution after the decoding has been complete. The trace cache essentially caches decoded micro-ops (the instructions after they have been fetched and decoded, thus ready for execution) so that instead of going through the fetching and decoding process all over again when executing a new instruction, the Pentium 4 can just go straight to the trace cache, retrieve its decoded micro-op and begin execution. On the Pentium 4, the 8-way set associative Trace Cache is said to be able to cache approximately 12K micro-ops.
This helps to hide the penalties associated with a mis-predicted branch later on in the Pentium 4's 20-stage pipeline. Another benefit of the trace cache is that it caches the micro-ops in the predicted path of execution, meaning that if the Pentium 4 fetches 3 instructions from the trace cache they are already presented in their order of execution. This adds potential for an incorrectly predicted path of execution of the cached micro-ops however Intel is confident that these penalties will be minimized because of the prediction algorithms used by the Pentium 4.
SSE2: The other key to the Pentium 4’s success?
We mentioned earlier that one of the keys to the Pentium 4’s success will be ramping up production of higher clock speed versions of the processor. Since the CPU will depend on its ability to have a high clock speed in order to order to remain competitive there should be a strong focus on that part of the Pentium 4’s development.
There is another trick that Intel has up their sleeves and that is the follow-up to Intel’s Streaming SIMD Extensions (SSE). As you’ll remember from our original Pentium III Review, the acronym SIMD stands for Single Instruction Multiple Data and it is a technology that allows a single instruction to be applied to multiple datasets simultaneously. This comes in handy when dealing with fairly repetitive operations on different sets of data. The example we’ve always quoted is the transformation of a polygon in mathematical space (a list of numbers) to a polygon in 3D space for display on the screen. This process requires quite a bit of matrix mathematics which, to those that have had a basic introduction to matrix math, is highly repetitive. SIMD-FP (Floating Point) extensions can help speed up this transformation process by taking the multiplication, addition and reciprocal functions required by the process and apply them to the multiple datasets involved at once.
The first incarnation of SIMD instructions came with the introduction of Intel’s 57 MMX instructions back in 1997 which are essentially a set of SIMD-Int (integer) instructions. The first incarnation of Floating Point SIMD instructions came in the form of AMD’s 3DNow! enhancements with the K6-2 processor. However with the release of the Athlon, 3DNow! became virtually useless since the Athlon had such a powerful FPU in comparison to the K6-X series it was replacing. Intel got on the ball months later with their first SIMD-FP extensions to the x86 instruction set and called them their Streaming SIMD Extensions, or SSE for short.
Intel’s SSE instructions actually proved to impact performance considerably in certain situations. And Intel is following up the success of SSE with the 144 new instructions that encompass SSE2. These instructions offer an extension to MMX as well as SSE. SSE2 enables the Pentium 4 to handle two 64-bit SIMD-INT operations and two double precision 64-bit SIMD-FP operations. This is contrast to the two 32-bit operations MMX and SSE used to be able to handle. The benefit of being able to handle two 64-bit operations through SSE2 is greater performance and in the case of SIMD-FP instructions, the ability to handle greater precision floating point calculations which is very important when dealing with more professional level applications.
The problem with SSE2 is that these instructions will require software support in order for them to be taken advantage of, meaning that previous applications will not benefit from them. Luckily, as we have seen with SSE, Intel shouldn’t have a problem drumming up support for SSE2 and the next generation of applications and games should boast SSE2 optimizations.
Another positive note for SSE2 is that AMD will be supporting SSE2 with their upcoming 64-bit processors that are due to be released at the end of next year or the beginning of 2002. This should give the standard even more power behind it and will allow AMD to benefit from the same performance enhancements that will help the Pentium 4.
With the incredible power of SSE2 at the Pentium 4’s fingertips keep your eyes peeled at how gaming performance changes over the next few months. A SSE2 optimized game could gain a hefty performance boost on the Pentium 4, as could just about any other SSE2 optimized application. Luckily for AMD, Intel will do most of the work involved in getting the SSE2 instructions supported, AMD will just have to support the instructions and their next generation processors will be able to reap all the benefits.
The Interface
The Pentium 4 interfaces with the motherboard using a new socket, named Socket-423. We strongly suggest you read our article on Intel’s Desktop CPU & Chipset Roadmap for 2001 as we diagram that Intel won’t be sticking to the Socket-423 interface for long. In fact, after the first half of 2001 the Pentium 4 will start to be available in both a Socket-423 and a Socket-478 version. The latter will be due to replace Socket-423 much like Socket-370 replaced Slot-1. For those of you that currently have Slot-1 CPUs, do you want to be in the same situation with the Pentium 4 as you current are with your CPU?
In contrast, AMD plans to stick with Socket-A for at least a little while longer. The next generation Athlon and Duron CPUs, according to AMD, will work in current Socket-A boards however you will obviously need a new board with 266MHz (133MHz DDR) FSB support if you want to get the most out of a 266MHz FSB CPU.
Without looking at benchmarks, the short lived nature of the Socket-423 interface may be reason enough to hold off on a Pentium 4 purchase for a few more months if you are set on that path.
Making the Chip
The Pentium 4 is still based on the same 0.18-micron process that the Pentium III has been using since this time last year. Intel is still employing Aluminum interconnects and they will continue to do so until closer to the end of next year.
At IDF Intel let us know that by the end of 2001 all of their CPUs should be using Copper interconnects which could mean that when the Pentium 4 is manufactured on Intel’s 0.13-micron process it’ll get Copper interconnects as well. AMD has been using Copper interconnects for a while now (all of their Dresden manufactured CPUs use Copper), the main benefit is that it helps you attain higher clock speeds and with the Pentium 4 geared towards higher clock speeds, implementing a 0.13-micron fabrication process and Copper interconnects could be critical to its success.
Since the Pentium 4 is based on the aging 0.18-micron process yet has such a complex design (the Hyper Pipelined Technology takes up some pretty expensive real estate) the processor itself ends up being huge. The CPU, composed of 42 million transistors which is even larger than the current Athlon which has 37 million transistors. However while the Athlon has a 120mm^2 die, the Pentium 4 features a 217mm^2 die which is over twice the size of the Athlon’s die.
This unfortunately means that the yields on the Pentium 4 could be low since the larger the die the lower the yield. However, as we pointed out in our analysis of Intel’s Desktop CPU & Chipset Roadmap for 2001, with the Pentium 4 scheduled to hit 2GHz in the third quarter of next year and the Pentium 4 itself to become a mainstream processor, Intel seems to be pretty confident in their ability to produce the processor at high enough yields to make the transition to the mainstream market.
To the right you can see
the actual Pentium 4 core. Kind of big isn't it?
This incredible die size also means that Intel won’t be able to produce as many Pentium 4 chips per wafer as they did with the Pentium III. This in turn makes the Pentium 4 more expensive to manufacture, however as we pointed out, Intel wants to aggressively ramp up production of the Pentium 4 so they may end up taking a hit on price in order to get it out there.
The Pentium 4 definitely needs to be on Intel’s 0.13-micron process, next year can’t seem to come quick enough.
With a huge die and already running at 1.5GHz you’d expect the Pentium 4 to produce quite a bit of heat. Interestingly enough it “only” dissipates 52W of heat at 1.5GHz. Compared to the thermal characteristics from AMD’s Athlon, this seems quite low. The Pentium III at 1GHz dissipates 33W and the Athlon at 1GHz dissipating 54W. At 1.4GHz the Athlon (Thunderbird), is supposed to be producing between 68 and 76W of heat. Hopefully the new Athlon core which is due out in the first quarter of 2001 will be much cooler running.
Because of the sheer size of the Pentium 4’s core Intel employed an integrated heat spreader in order to take the concentrated heat being produced and spread it over a larger surface area. This makes it able to dissipate heat in a more effective manner, much like the integrated heat spreaders present on RDRAM RIMMs.
Constructing Mt. Everest: The Pentium 4’s Heatsink
We just finished mentioning that the Pentium 4 produces less heat than a lower clocked Athlon, but Intel refrained from sticking with the conventional cooling methods they employed for all Socket-370 processors and what AMD is using for their Socket-A CPUs. Instead Intel is debuting a new heatsink retention mechanism that will help to avoid the dreaded crushed core syndrome that some Athlon/Duron owners have seen in recent times.
Let’s take a look at what it takes to assemble the Pentium 4’s heatsink:
First the Heatsink Retention Mechanism is screwed into the motherboard and into the case as seen below. If you've ever installed a Xeon, it's much similar to that retention mechanism, except you're dealing with a socketed CPU not one on a processor card.
Getting both
retention bases installed isn't a problem:
Now let's have a look at the heatsink itself:
Of course retail heatsinks won't look exactly like this but they will be similar in size.
The next step is to place the heatsink on the platform after applying a decent amount of thermal compound. There is no need to put any pressure on the heatsink at this stage.
Using the two retention clips, the heatsink clamps down onto the retention base that you installed earlier.
The final step is putting the fan on the whole thing.
And now we have the finished product:
A new CPU requires a new chipset and a new bus
Once known to us only as “Tehama” we now know that the i850 chipset will be the platform the Pentium 4 will run on, and could well be the reason for its slow adoption, at least at first.
The i850 chipset isn’t much different from the little used i840 chipset. The chipset features AGP 4X support, it interfaces with Intel’s ICH2 chip that provides Ultra ATA/100 support and the only real difference between it and the i840 is that it doesn’t support multiple processors and it supports the Pentium 4’s AGTL+ bus.
The difference between the Pentium 4’s AGTL+ bus and the AGTL+ bus used by the Pentium III is that the Pentium 4’s bus is a quad pumped 100MHz bus effectively running at 400MHz. Since it is based off the same bus that we are used to from Intel, it is still a shared protocol meaning that when the multiprocessor version of the Pentium 4 hits (codename Foster) each CPU will share the 3.2GB/s of available bandwidth provided by the 400MHz FSB. AMD on the other hand uses the EV6 bus which uses a point to point protocol that gives each CPU a dedicated 200 – 400MHz path to the North Bridge. Intel’s shared protocol is cheaper for motherboard manufacturers to implement but of course there is a performance reduction that is paid for it. For now this doesn’t matter since the Pentium 4 is strictly a uni-processor part, but it will matter when its SMP counterpart, Foster, is released.
Since the i850 has its roots in the i840 chipset, it also happens to share the same dual channel RDRAM memory controller as the i840. Unfortunately this means that the Pentium 4 will have no SDRAM support at its launch. Chances are that VIA will be the first out with a SDRAM and possibly a DDR SDRAM chipset for the Pentium 4, but at least we know that Intel is planning a chipset with support for both standards the only problem is that they aren’t scheduled to release this chipset until after Q3/2001.
With two RDRAM channels the latency issues surrounding RDRAM are lessened and the amount of available bandwidth doubled, however it requires that you install RIMMs in pairs of two. This means that i850 boards will have 4 RIMM slots, however since each RDRAM channel is only 16-bits wide it shouldn’t be too expensive for a motherboard manufacturer to implement. This is still one of the benefits of RDRAM from a layout perspective. If you’ve noticed, most DDR boards are shipping with 3 or sometimes just 2 DIMM slots because of the number of traces required for a 64-bit data path to the North Bridge capable of handling such high transfer rates.
With Intel involved in an agreement with Rambus they are in a difficult position. They have to promote the Pentium 4, however doing so will require that they essentially promote Rambus. There is no question about it, Intel will have to promote the Pentium 4 and in an effort to push sales of the processor and remove the price of RDRAM from the equation Intel will be bundling two 64MB PC800 RDRAM RIMMs with every boxed Pentium 4 processor. This way, Intel can absorb most of the RDRAM price premium and take away one of the obstacles from owning a Pentium 4. This does not apply to OEM Pentium 4 processors as they won’t come with any RDRAM, which leads us to wonder how big of a price difference will exist between retail (boxed) and OEM Pentium 4 CPUs.
Looking at the actual i850 Memory Controller Hub (MCH) you’ll see from the picture below that it looks more like a processor than an Intel chipset.
The chipset also requires the presence of a heatsink, and the one on Intel’s D850GB Pentium 4 board is quite large in comparison to what we’re used to from chipset heatsinks.
As we mentioned in our latest Intel Roadmap article the biggest downside to the i850 chipset aside from its RDRAM only support is that the chipset is currently priced at close to twice as much as any other chipset. While even the most expensive AMD 760 chipset is selling for $39 (North Bridge & South Bridge), the i850 (MCH + ICH2) is going for an expensive $75. This is going to make Pentium 4 motherboards very expensive, you’re looking at a minimum of $200 for an i850 board.
A new chip requires a new…Case & Power Supply?
Along with the Pentium 4 Intel is introducing support and a need for the new ATX 2.03 specification. This specification basically makes room for the mounting holes for the Pentium 4’s heatsink retention mechanism in the case. This unfortunately means that current cases, without modifying the motherboard tray, won’t work with the Pentium 4 with its heatsink attached. If you’ve really got a lot of money invested in your case you can try and make your own mounting holes by lining up your motherboard and making the appropriate marks to drill through on the tray.
The next big change is with the power supply. The ATX 2.03 spec calls for an ATX12V power supply which supports the additional power connector required by i850 boards. This additional +12V power connector allows for additional power to be supplied to the motherboard around the CPU. With CPUs increasing in clock frequency and drawing more and more power, this helps to keep things stable in an area of the motherboard where current draw is at the highest levels.
Benchmarking the Pentium 4: A warning
We mentioned in our review of the AMD 760 chipset that we would be using SPEC CPU2000 in future reviews. Unfortunately, we were waiting on two things from Intel that have yet to arrive which prevented us from completing our own SPEC tests for this review. The Intel 5.0 compilers which have just been recently released (on last Monday) and appropriate optimization flags for use with the new compilers and the Pentium 4 platform are still not in our hands, however when we do get them we will work on bringing you analysis of SPEC CPU2000 performance of the Pentium 4 and its competitors.
With the issue of SPEC CPU2000 we’d like to bring you a warning. While at Comdex we were informed that apparently some pre-compiled SPEC CPU2000 binaries were distributed for use in a SPEC CPU2000 comparison to reviewers. We honestly hope that these binaries are not used for a cross platform comparison since they are compiled with Pentium 4 optimizations alone and could paint a very poor and untrue performance picture for the Athlon. SPEC CPU2000 is a very delicate benchmark and is quite compiler dependent. At AnandTech we make it a point to compile the benchmark under the best conditions for all processors so if you see any SPEC CPU2000 scores published make absolutely certain that they have been compiled separately for the Pentium 4 and Athlon platforms. Otherwise the benchmarks are essentially invalid. For the Pentium 4 platform the benchmarks should be compiled using Microsoft Visual Studio 6.0 and Intel’s C/C++ and Fortran 5.0 compilers. For the Athlon platform the benchmarks should be compiled using Microsoft Visual Studio 6.0 with Intel’s C/C++ and Fortran 4.5 compilers as well as with Compaq Visual Fortran 6.5. Both AMD and Intel have submitted the appropriate config files that should be used to SPEC at www.spec.org.
If any source doesn’t have the config and compiler information made available in their publication be an active reader and be sure to question the results.
The Test
Windows 98SE / 2000 Test System |
||||||
Hardware |
||||||
CPU(s) |
Intel
Pentium 4 1.5GHz Intel Pentium 4 1.4GHz |
Intel Pentium III 1GHz
|
AMD Thunderbird
1.2GHz |
|||
Motherboard(s) | Intel D850GB | ASUS CUSL2/Intel OR840/Intel VC820 | ASUS A7V/AMD Corona 760 Reference Board | |||
Memory |
256MB
PC133 Corsair SDRAM (Micron -7E CAS2) |
|||||
Hard Drive |
IBM Deskstar 30GB 75GXP 7200 RPM Ultra ATA/100 |
|||||
CDROM |
Phillips 48X |
|||||
Video Card(s) |
NVIDIA GeForce 2 GTS 32MB DDR (default clock - 200/166 DDR) |
|||||
Ethernet |
Linksys LNE100TX 100Mbit PCI Ethernet Adapter |
|||||
Software |
||||||
Operating System |
Windows
98 SE |
|||||
Video Drivers |
|
|||||
Benchmarking Applications |
||||||
Gaming |
Unreal
Tournament 4.32 Reverend's Thunder.dem |
|||||
Productivity |
BAPCo SYSMark
2000 |
|||||
Low Level |
Linpack
SiSoft Sandra |
With a 50% higher clock speed over the rest of the processors compared in the above chart, we get to see some interesting aspects of the Pentium 4's performance. For starters, while the clock speed advantage should have given the Pentium 4 an incredible lead at the start, we can see that it peaks at the same performance level as a 1GHz Athlon which was one reason for choosing only 1GHz processors to compare here.
Because of the Athlon's larger L1 Data Cache (64KB vs 8KB on the Pentium 4), the Athlon reaches its peak performance much quicker than the Pentium 4 does. The two difference architectures yield two different performance curves obviously, and depending on the size of the data set the Athlon in some cases is faster than the Pentium 4, and in other cases the Pentium 4 is faster. Judging by these performance graphs it wouldn't be surprising to see an Athlon at 1.5GHz offer an even greater level of performance than the Pentium 4 considering that the 1GHz Athlon is coming very close to doing so already.
What is interesting is what happens after the L1 and L2 caches are filled and the data set begins to grow in size. The latter part of the graph is as close to clock speed independent as possible since its mainly depending on memory and FSB performance. In spite of using the same memory controller as the i840 chipset, the Pentium 4 on the i850 chipset holds a 75% performance advantage over even the fastest AMD 760 DDR platform.
There are a number of explanations for this. It could be that the Pentium 4 truly requires RDRAM to shine, it could be that the Pentium 4's 400MHz bus is giving it a major performance advantage here among other possibilities. Since the i850 shares the same memory controller as the i840 it is unlikely that RDRAM is the cause of that performance curve, in which case the 400MHz FSB could really be coming in handy as it is able to give the Pentium 4 the bandwidth it needs. However it is hard to believe that the AMD 760's 266MHz bus isn't enough to do at least the same.
The only remaining explanation points us in the direction of the Pentium 4's architecture itself. As you're sure to see in benchmarks floating around, the Pentium 4 will do quite well in SPECfp_2000, partially because of processor specific compiler optimizations that show you exactly how much performance is to be gained from taking full advantage of the Pentium 4's architecture, but also because with some of the large data sets of the benchmark, the Pentium 4's highly advanced branch predictor unit can come in handy. By correctly predicting a large portion of branches, the Pentium 4 should be able to excel at most of the SPECfp_2000 tests which we determined to have large data sets. This Linpack performance curve is indicative of something quite similar happening here, as the data set increases in size the Athlon and the Pentium III drop down to a lower level of performance while the Pentium 4, with a more advanced branch predictor rises to the top.
Looking towards the future the Pentium 4's performance curve here could be something to keep in mind, but let's take a look at its performance in today's applications before concluding on that.
Starting out with Quake III Arena we see that the Pentium 4 immediately jumps to the top of the performance charts. Here we have a case where Quake III Arena is benefiting greatly from the Pentium 4's 400MHz FSB. We noticed early on that Quake III Arena is very sensitive to memory and FSB performance. We can rule out memory performance being a major contributor to the Pentium 4's performance advantage here since the Pentium III 1GHz on the i840, with a similar memory subsystem, is unable to step ahead of the competition.
A major factor for the Pentium 4's performance will be the 400MHz bus, especially as games increase in complexity and the amount of data being sent to and from the CPU increases.
At 1024 x 768 x 32, once again, we're limited by the memory bandwidth of our test GeForce2 GTS. This illustrates both the need for a more efficient memory subsystem when it comes to today's graphics cards, and it also illustrates that you don't need to have a 1.5GHz processor to run at these high resolutions, a 500 - 600MHz processor will do the job almost as well. This doesn't hold true for all games, especially when dealing with more CPU intensive games like flight simulators, however for first person shooters like Quake III Arena you've got to face the facts.
In spite of the 12% lead the 1.5GHz Pentium 4 took in the Quake III Arena benchmarks, the 1.2GHz Athlon on the AMD 760 platform manages to take a 5% lead over the 1.5GHz P4. This is the perfect example of how the Pentium 4 needs a higher clock speed in order to distance itself from the competition. At clock speeds close to that of the Athlon, without any SSE2 specific optimizations, the Pentium 4 will almost always come out under the Athlon.
We pointed out in an earlier article that Intel will be releasing a 1.3GHz Pentium 4 early next year. Judging by the performance we see here, you can expect a 1.3GHz Pentium 4 to perform barely faster than a Pentium III clocked at 1GHz.
Once again, as the resolution reaches a certain level, the graphics card becomes the limiting factor. At this point, regardless of your CPU, you're not going to be able to get any more performance out of your system.
Again, the Athlon at 1.2GHz holds a 10% lead over the Pentium 4 at 1.5GHz. Even on a PC133 platform, the 1.2GHz Athlon ends up 2% faster than the Pentium 4.
UnrealTournament is a very texture intensive game, meaning that these a considerable amount of stress on memory bandwidth which the AMD 760 DDR platform definitely has. At the same time there is a stress on low latency memory performance, which the AMD 760 platform has, but which the Pentium 4 also has. Remember that the Athlon can do more in a single clock than the Pentium 4, making the 300MHz difference in clock speed between the Pentium 4 and the Athlon compared here mean very little.
Under Expendable, a very memory performance dependent test, the Pentium 4 suffers considerably, falling to the bottom of our performance charts. Even a high clock speed wouldn't help the Pentium 4 here, it seems like there are some situations in which the Pentium 4 just isn't a high performer.
The SYSMark 2000 benchmark is perfect for illustrating what even a PC133 SDRAM platform could do for the Pentium 4. If you notice, the 1.5GHz Pentium 4 is just 3 points lower than a 1GHz Pentium III on an i840 platform, keep in mind that the two platforms have the exact same memory configuration. Now look at the same Pentium III on a PC133 i815 platform, see the huge performance advantage it gets just by moving to PC133 SDRAM instead of the i840's dual channel RDRAM setup? Chances are we'd see a similar boost for the Pentium 4, bringing it much closer in performance to the Athlon, however we won't be seeing that anytime soon. And for the home/office user, the Pentium 4 would actually be a downgrade in many cases.
The Pentium 4 does much better under Content Creation Winstone 2000 which provides a more realistic performance testing environment than SYSMark 2000. Content Creation Winstone 2000 measures performance while multitasking, switching between office and content creation applications such as MS Word, Photoshop, Dreamweaver, etc... and performing normal tasks in each application before switching to another.
The low latency caches of the Pentium 4 come in handy here as with these applications you're generally not dealing with a huge sets of data.
High End Winstone was once dominated by Intel, before the introduction of the Athlon. Now, even with the Pentium 4, High End Winstone still belongs to the Athlon although at 1.5GHz the Pentium 4 does put up a bit of a fight. Again, we have a situation in which the Pentium 4 needs a much higher clock speed to compete. Couple that with a PC133 or DDR SDRAM platform to run on and the Pentium 4 could become a serious contender, however until then the Pentium 4 at 1.4GHz doesn't even make sense to pursue and at 1.5GHz it's barely able to outperform a 1GHz Athlon priced at almost half of its cost.
Here the Pentium 4 has a ton of catching up to do, it's at the point where it would take much more than a clock speed boost in order to make it competitive. SSE2 enhancements could most definitely help here as the Pentium 4 is really hurting.
This is the second real world benchmark in which the Pentium 4 comes out on top, showing that in certain cases its architecture can be a strong performer without an extremely high clock speed or SSE2 optimizations. Part of the performance advantage here could be the memory subsystem because if you look at the i840 platform, which shares the same memory controller as the i850, its performance is quite respectable.
Once again the Pentium 4 returns to its performance level that's just barely faster than the Pentium III and slower than all of the Athlons in this comparison.
Again we get a similar picture painted here, the Pentium 4 still isn't cutting it in today's benchmarks.
Another example of where a higher clocked Pentium 4 could come in handy is in this MedMCAD test where the Pentium 4 at 1.5GHz has the slight advantage and could definitely take a clear lead if clocked even higher. Unfortunately for Intel, it seems as if AMD will always be hovering around the same clock frequency as the Pentium 4, at least for next year, with an Athlon.
The Pentium 4 is truly a bandwidth monster as we discovered with our initial Linpack benchmarks. While the 50% higher clock speed makes up for some of this lead, the majority of it is due to its 400MHz FSB, low latency caches, and excellent branch predictor. Too bad the real world performance of the processor, using today's applications, isn't able to take advantage of this incredible amount of bandwidth.
Again, the Pentium 4 holds an enormous bandwidth advantage over the competition. And again, it's something we don't see duplicated in the real world tests, it will be interesting to see how tomorrow's games and tomorrow's benchmarks view the Pentium 4.
The RC5 performance test is almost exclusively a measure of Integer performance. Factors such as memory bandwidth, FSB frequency or cache performance don't affect the score and as you can see, the Pentium 4 isn't doing good at all. Even with an optimized RC5 client it is doubtful that the Pentium 4 will be able to do much in purely integer tests like this one.
For more information on how you can join our RC5 team click here.
Final Words
So there we have it, the long awaited successor to Intel's P6 micro-architecture is here and the long awaited follow up to Intel's Pentium III brings it to us. The Pentium 4 is an interesting CPU, simply because it has a number of features going for it, yet it seems like there are an equal number of factors working against its success. Let's talk about its strengths first:
The Pentium 4 will thrive as its clock speed increases, unfortunately for Intel it seems like the Athlon may be able to compete in terms of clock speed ramp, at least for the time being. While still on a 0.18-micron process AMD should be able to hit at least 1.5GHz with the Athlon. It will take at least a 2GHz Pentium 4, released earlier than what Intel's current roadmap places it at, to compete in terms of overall performance with the Athlon. As the Athlon increases in clock speed, the Pentium 4 will have to match its performance level with an even greater increase in clock speed.
Low latency and high bandwidth are the keywords when it comes to talking about the Pentium 4's caches. The high hit rate L1 cache and the extremely high bandwidth L2 cache will make the Pentium 4 a solid starting ground for any future NetBurst micro-architecture based designs.
Provided that developers take advantage of them, the 144 new SSE2 instructions could yield quite a major performance improvement for the Pentium 4.
The Pentium 4's extremely advanced branch predictor could prove to be very useful in tomorrow's floating point intensive applications. It isn't a surprise that the computing world is moving towards a more 3D environment, and with that comes an influx of more floating point intensive applications. Intel is banking on the Pentium 4's architecture being able to succeed in tomorrow's computing world.
Requiring the use of the the ATX 2.03 specification is a step in the right direction although it will be faced with much opposition since it will require that everyone purchase new cases and new power supplies. The fact of the matter is that today's systems do need more power and maintaining stability by making sure that today's systems have enough power is worth the extra $100 in the long run.
Unfortunately, in spite of the many good points about the Pentium 4, at least on paper, there is just too much working against it.
For starters, while the Pentium 4 requires a higher clock speed to maintain a performance lead, the fact of the matter is that according to Intel's roadmap the CPU won't hit 2GHz until the third quarter. The next Pentium 4 to hit the streets will be the 1.3GHz Pentium 4 which will offer a very low performance level compared to the competition, it would make sense to pursue a Pentium III instead of a 1.3GHz P4. If you're thinking about keeping a longer lasting system, don't forget that the Socket-423 interface will begin to be phased out starting at the middle of next year so the Pentium 4 won't leave you in much better of a position than the Pentium III.
While it's a good idea for Intel to attempt to take the price of RDRAM out of the picture by bundling two sticks with each boxed CPU, this isn't a true solution to the problem. The solution that needs to be implemented is that the Pentium 4 needs a DDR SDRAM platform, preferably one from Intel (VIA hasn't always had the best memory performance) and it needs one before it's too late. According to Intel, the Brookdale chipset (DDR SDRAM for the Pentium 4) won't be out until the first quarter of 2002, by that time even Dell will be begging for AMD chips if there is no DDR chipset for the Pentium 4. If Intel doesn't come through with one it seems like it will be up to VIA, luckily their DDR memory controller is already sampling in Apollo Pro 266 chipsets so its mainly a matter of licensing the bus from Intel and implementing it in a North Bridge design.
We mentioned that SSE2 is a benefit that the Pentium 4 holds. At the same time it is a downside, since a lot of the power of the Pentium 4 could come from the proper optimization of applications for SSE2 which we won't see in most applications for still some time to come. With AMD also supporting SSE2 by the end of 2001, the 144 instructions should be embraced by the industry and they will, but it will take some time.
For today's buyer, the Pentium 4 simply doesn't make sense. It's slower than the competition in just about every area, it's more expensive, it's using an interface that won't be the flagship interface in 6 - 9 months and it requires a considerable investment outside of the price of the CPU itself. Remember that you have to buy a new motherboard, new memory (if you don't get it bundled with a boxed CPU), and a new power supply/case. This is the investment that must be made in order to have a CPU that can't outperform any of today's top performers with the promise that tomorrow's Pentium 4 will be better.
Our recommendation to you? Wait until the Pentium 4 turns out to be a bit more, SSE2 support is still in its infant stages, the i850 platform is doomed because of its exclusive RDRAM support, the Socket-423 interface will go away pretty soon and the performance just isn't there. Intel does have the potential to make the Pentium 4 a success, for the reasons we just mentioned and discussed further in the article, however it's far from a success today.