Original Link: https://www.anandtech.com/show/495
The name Intel Developer Forum (IDF) would normally indicate a convention revolving almost exclusively around Intel, but our experience with this year’s Spring IDF was quite the contrary. While it is true that we got quite an interesting look at what Intel has planned for this year we also got to see a microcosm representing the future of the x86 computing world as a whole, including a few interesting tidbits of information from AMD, the last face you’d expect to hear about at an IDF.
The Spring 2000 IDF was held in sunny Palm Springs, California, but it will most likely be the last time that the IDF is held in Palm Springs. The fairly small convention was home to over 3000 developers and a handful of the press for its duration, but the forum itself was very organized and it required quite a bit of teamwork to get as much coverage of the happenings as possible.
We brought you live coverage of the Forum from the press room in the Wyndham Hotel which is where some of the sessions took place. Our coverage was split into four parts, the first dealing with CPUs, particularly the Willamette and the Timna processors that were demonstrated at the Forum. The second part focused on Serial ATA and USB 2.0, two technologies that were demonstrated (rather poorly demonstrated, but displayed nonetheless), the third part of our coverage focused on Intel’s workstation/server class Itanium processor and the fourth offered a summary of the show and some pictures of Itanium based servers.
Now, two weeks after our initial coverage of the show we are bringing you a complete summary of all that we’ve seen as well as some unique insight into exactly what Intel and AMD have planned for the future. So without further ado, let’s get to Intel’s first announcement, the Willamette.
The Willamette
Grasping the architectural advantages of the Willamette may be a difficult feat to accomplish, but for most of the IDF attendants, learning how to pronounce the code name of the upcoming processor was even more challenging. For starters, the name Willamette (taken from a river near where the CPU was designed), is pronounced Will-AM-ette. But now to the more important things ;)
Willamette will be Intel’s first major architectural change since the Pentium Pro was introduced seemingly ages ago, in 1995. What all is a part of this “major architectural change” that we keep on talking about?
20 Stage PipelineBy increasing the pipeline of the Willamette (up from 10-stages on the Pentium III) Intel is able to increase the clock speed of the Willamette. How can increasing the pipeline to 20 stages allow for higher clock speeds? By increasing the length of the pipeline, less is done in a single clock cycle meaning that the CPU can operate at a higher frequency whereas with a shorter pipeline, where more is done in a single clock cycle (more complex work per clock) you are limited to lower clock frequencies.
Since the Willamette is geared to be the successor to the Pentium III (which is the reasoning behind most people referring to it as the Pentium IV), it makes perfect sense that Intel would want to debut it at a higher clock speed than the Pentium III. With the current Pentium III Coppermine core to hit 1GHz around the launch, this clock speed advantage is absolutely necessary because otherwise it would be difficult for Intel to demonstrate any purpose for the Willamette if it weren’t clocked noticeably higher than the Pentium III. Remember, since the Willamette probably can’t do as much per clock as the Athlon/Pentium III it will rely on this increased clock speed to remain competitive.
As the clock speed increases beyond what the Pentium III would be able to accomplish, something that is made possible because of the deeper pipeline, the Willamette’s 20 stage pipeline will truly begin to shine.
In essence, the number of pipeline stages you have helps to determine the attainable clock speed of your processor, the deeper the pipeline, the higher your clock speed can go.
Modern day CPUs attempt to increase the efficiency of their pipelines by predicting what they will be asked to do next, this is a simplified explanation of the term Branch Tree Prediction. When a processor predicts correctly, everything goes according to plan but when an incorrect prediction is made the processing cycle must start all over at the beginning of the pipeline. Because of this, a processor with a 10 stage pipeline has a lower penalty for a mis-predicted branch than that of a processor with a 20 stage pipeline. The longer the pipeline, the further back in the process you have to start over in order to make up for a mis-predicted branch.
With the Willamette boasting a 20 stage pipeline, the penalty for a mis-predicted branch is much greater than it would be on a Pentium III. So how did Intel attempt to work around this problem?
Trace Cache
The decoder of any x86 CPU (what takes the fetched instructions and decodes them into a form understandable by the execution units) has one of the highest gate counts out of all of the pieces of logic. This translates into quite a bit of time being spent in the decoding stage when preparing to process an instruction either for the first time or after a branch mis-prediction.
This is where the Willamette’s trace cache comes into play. The trace cache acts as a middle man between the decoding stage and the first stage of execution after the decoding has been complete. The trace cache essentially caches decoded micro-ops (the instructions after they have been fetched and decoded, thus ready for execution) so that instead of going through the fetching and decoding process all over again when executing a new instruction, the Willamette can just go straight to the trace cache, retrieve its decoded micro-op and begin execution.
The addition of the trace cache in the case of the Willamette isn’t only to improve performance, but it’s to hide the penalties associated with incorrectly predicting a branch deeper into the Willamette’s 20 stage pipeline. Since, on the Willamette, an incorrectly predicted branch could potentially only send the instruction back to the trace cache where the fetching/decoding process could be skipped and execution could take place almost immediately, a major downside to the Willamette’s 20 stage pipeline is somewhat masked by this trace cache.
Another benefit of the trace cache is that it caches the micro-ops in the predicted path of execution, meaning that if the Willamette fetches 3 instructions from the trace cache they are already presented in their order of execution. This adds potential for an incorrectly predicted path of execution of the cached micro-ops however Intel is confident that these penalties will be minimized because of the prediction algorithms used by the Willamette.
Double Pumped ALU & Low Latency Data CacheThis was the big attention getter when we published our first live report from IDF, the Willamette has a double pumped Integer Arithmetic Logic Unit (ALU). The ALU actually executes the instructions in the Willamette and by “double pumping” it, Intel is able to make the two physical ALUs of the Willamette produce the benefits of four ALUs each running at the core frequency of the CPU.
The Willamette’s double pumped ALU should make the business application and content creation performance of the processor very difficult to beat since those two areas are relatively non-FPU intensive and would benefit greatly from the double pumped Integer ALU.
The double pumped ALU naturally reduces the latency associated with executing instructions, for example, a single add or subtract would take only 1/2 of a clock cycle on a Willamette because of the double pumped ALU. Theoretically, you could execute a total of four instructions in two clock cycles through the Willamette’s two physical ALUs courtesy of their double pumped nature.
Along the topic of low latencies, the Willamette will also feature a very low latency data cache which makes up half of the CPU’s L1 cache. The L1 data cache boasts an extremely low 2 clock load latency which is considerably lower than the L1 data cache latency on the Pentium III.
FPU & SSE2
The Willamette, as we mentioned in our first IDF Report, will have 144 new Streaming SIMD Extensions 2 (SSE2) instructions that could either make or break the FPU performance of the Willamette. Why do we say that?
There isn’t much known about the Willamette’s FPU, as far as we know now, it alone could be inferior to the Athlon’s FPU. Intel did very little to talk about the Willamette’s FPU at the Spring 2000 IDF, rather they focused on what SSE2 could do for the Willamette in terms of floating point performance.
The benefits of SSE2 come from its extensions to MMX and SSE. SSE2 offers 128-bit SIMD-Int and 128-bit SIMD Double Precision FP instructions, the former being an extension of 64-bit MMX and the latter being an extension of 64-bit SSE. Being able to handle two 64-bit double precision FP operations will be very useful in the professional arena, especially in MCAD and 3D visualization applications among others. This is if SSE2 is taken advantage of in these particular applications as well as the drivers on the video card level, if not, then the FPU performance of the Willamette is left at what its FPU can accomplish alone.
SSE2 also features cache and memory management operations as well as new encryption operations. While those last two features are a bit vague, we should know more about them at this Fall’s IDF.
Willamette Bus & Tehama ChipsetThe Willamette features a 64-bit, 3.2GB/s FSB that will most likely operate at 100MHz QDR, or quadruple pumped as Intel likes to call it. This means that the FSB operates at 100MHz but fetches 4 times as much data per clock, much like the way AGP 4X operates today.
The only chipset to support the Willamette at its launch will be the Tehama which boasts exclusive support for RDRAM as a memory type. The chipset will feature a dual channel RDRAM interface, much like that on the i840 chipset, which provides for a maximum of 3.2GB/s of peak memory bandwidth. While the chipset does not officially support SDRAM, should the need arise (translation: if Intel is wrong and the cost of RDRAM doesn’t drop by the release of the Willamette), motherboard manufacturers should theoretically be able to use Memory Translator Hubs on Tehama boards to translate the RDRAM memory requests into SDRAM memory requests.
Then again, the performance benefits of RDRAM are supposed to scale quite nicely as CPUs get faster and faster so the Willamette may actually show appreciation for RDRAM while our current 133MHz FSB Pentium IIIs are just happy with PC100/PC133 SDRAM.
The new Celeron & Timna
Intel will shortly be introducing even higher clock speed Celerons manufactured on the 0.18-micron process with the same 128KB of L2 cache we’re used to. Contrary to what we’ve been told in the past, these Celerons will be 66MHz FSB CPUs only, meaning that they will most likely be the next big overclockers for us to enjoy.
If the yields on these 66MHz parts are as high as the yields on the Coppermines, then it shouldn’t be too out of the question to see a Celeron 566 running at 100MHz x 8.5 instead of its default 66MHz x 8.5 setting. Since these CPUs are targeted at the low end market you can expect them to be quite affordable as well.
The continuation of the 66MHz FSB Celeron line may mean a new upgrade option for older BX/LX motherboard owners, but the processors would admittedly have to have BIOS support in order for this wish to come true. Imagine being able to use a board you bought over two years ago with a 700MHz processor…
The Celeron line will eventually welcome the Timna processor which, as we mentioned in Part 1 of our IDF coverage, is the first chip to boast Intel’s “Smart Integration.”
Smart Integration essentially takes the memory controller and the graphics controller and moves them off of the motherboard and actually integrates them onto the CPU. This will definitely be a poor gaming solution compared to the dedicated 3D graphics accelerator solutions that will be available around the release of the Timna, but for the entry level market the Timna should be quite successful.
According to Intel, the Timna’s integrated graphics should be an extension of the i752 graphics core, and not an entirely new design.
As a notebook solution, if it ever makes it into that market, the Timna should also be an interesting solution.
New Mobile Parts
On the mobile side of things we have the 100MHz FSB Celerons that shouldn’t be much different from their 66MHz desktop counterparts other than the increase in FSB frequency. These 100MHz FSB Celerons boast SSE support, a 0.18-micron fabrication process, and 128KB of on-die L2 cache running at clock speed and should thus offer some fairly decent competition to the more expensive 256KB mobile Pentium III parts that share the same features as the 100MHz FSB Celerons with the addition of an extra 128KB of L2 cache.
The new 100MHz FSB Celerons are available in 400MHz (400A), 450MHz and 500MHz parts. The mobile 400A runs at a 1.35v core voltage and requires 10W of power while the 450/500MHz chips both run at 1.60v which explains the 16.8W power consumption figure for those two. If you’re concerned about battery life, the Celeron 400A is the way to go, otherwise the 450/500MHz parts are excellent alternatives to their more expensive Pentium III counterparts.
For most users, a 500MHz notebook based on a Celeron 500/100MHz with 128KB of L2 cache would offer a much better value than their 256KB L2 mobile Pentium III counterparts. As of the date of publication, one of the new 500MHz mobile Celerons is approximately 40% cheaper than a 500MHz mobile Pentium III in OEM quantities of 1000. This translates into quite a noticeable difference in the overall cost of laptops based on these two processors, and for most users, the extra 128KB of L2 cache isn’t worth the added cost.
In order to keep the 100MHz FSB mobile Celerons from competing with the mobile Pentium IIIs, Intel will keep the clock speeds of these down to around the 500MHz level while the mobile Pentium III ramps up to 750MHz towards the middle of this year.
Another feature that the mobile Celerons don’t have is Intel’s recently announced SpeedStep technology that allows the dynamic adjustment of a processor’s clock frequency and core voltage depending on whether it is running off of battery power or plugged into a wall outlet. SpeedStep is currently reserved for the mobile Pentium IIIs.
Keep in mind that not all mobile Pentium IIIs feature SpeedStep, only the 600 and 650MHz parts do. The 600/650MHz SpeedStep parts drop to 500MHz when running on battery power and drop their core voltage from 1.60v to 1.35v at the lower frequency. Then we have the Pentium III 450 and 500 which run on a 1.60v core voltage in addition to two newer Pentium IIIs a 400 and an updated 500 that run at 1.35v core.
The updated 500s are essentially SpeedStep 600/650MHz chips that always run at the lower speed, ah how nice it would be if we could overclock laptops…
AMD’s Thunderbird
As you’ve all probably heard by now, AMD has had their new Thunderbird core up and running at 1.1GHz for quite some time.
The Thunderbird core improves on the current K75 core by integrating a full 256KB of L2 cache onto the 0.18-micron die itself which, although brings the L2 cache size down from 512KB, helps to improve performance because the Thunderbird’s L2 cache will be operating at clock speed instead of some odd divisor of the core clock.
This will help to remove a huge scaling bottleneck with the Athlon and will hopefully prepare the Athlon for a heated battle between itself and Intel’s upcoming Willamette. Plus, by moving the L2 cache onto the die itself AMD will be able to follow in Intel’s footsteps and move to a socketed design which should help to reduce costs.
AMD’s current K75 core will experience at least one more speed boost before the release of the Thunderbird core which will hit the streets around the E3 timeframe in May. Other than moving the L2 cache on-die and cutting it in half, the Thunderbird should be identical to the current Athlons with an obvious improvement in performance.
According to AMD, we should expect to see a 10 – 20% boost in performance across the board by moving the L2 cache on-die and running it at the core clock speed. With the Athlon already keeping up quite well with Intel’s Pentium III, this boost should give it the momentum AMD needs to pull ahead of the Pentium III and prepare for its true match against the Willamette. Luckily, if everything goes according to AMD’s plan, the Thunderbird will never have to be put up against the Willamette.
The current Athlons will disappear shortly after the introduction of the Thunderbird core, and the Thunderbird will spin off a low-cost version of itself known as the Spitfire. The Spitfire will feature 128KB of L2 cache running at clock speed and will only be available in a 426-pin Socket A interface.
The Spitfire will be positioned against the Celeron in the low-end market, the Thunderbird should be able to square off with the remaining higher clock speed Pentium IIIs, thus paving the way for a new CPU introduction from AMD to compete with Intel’s Willamette in October.
What AMD hopes to combat the Willamette with is a combination of the upcoming Mustang core with a minimum of 512KB of on-die L2 cache as well as the higher clock speed Thunderbird solutions.
Athlon Chipsets
It is going to be up to AMD to release the next major chipset for the Athlon since VIA is going to be focusing on more of the low-cost market with their upcoming Athlon chipsets, including an integrated video solution. While this does leave ALi to help out with providing the Athlon with the next major chipset platform, AMD has already committed to supplying the next two major chipsets so there isn’t a need for immediate support from ALi.
The AMD 760 chipset, the successor to the AMD 750 that was introduced with the Athlon last year will add DDR SDRAM support to the Athlon platform as well as offer an increase in the Athlon’s FSB frequency to 133MHz DDR (266MHz).
We are on the verge of running out of clock multipliers with the Athlon as 10.5x is the maximum clock multiplier on all Athlon CPUs available today. AMD’s 1.1GHz demonstration used a modified Athlon PCB to allow for the higher clock multiplier to be used. The introduction of the Thunderbird should change this, but until then companies like Kryotech are working with motherboard manufacturers to try and push the Athlon’s EV6 bus to 133MHz DDR now. Whether or not this will be possible before the AMD 760 chipset is released is up for grabs but it’s interesting nevertheless.
An interesting thing to note is that the AMD 760 chipset will be Socket-A only, meaning that by the time the AMD 760 makes its way to market, you better be using a Socket-A processor otherwise you’re stuck with the older generation of chipsets. AMD is hoping to ramp up their Socket-A production very quickly since they are somewhat behind Intel in that respect.
And finally, following the release of the AMD 760 chipset we should finally see the AMD 770 chipset which will be the first dual processor Athlon chipset to hit the market. Currently the lack of any server-level motherboards/chipsets is keeping the Athlon from entering that market, this should hopefully change towards the end of this year and on into 2001.