Original Link: https://www.anandtech.com/show/769



It was less than three years ago that Intel released the Pentium II Xeon processor.  Based off of the same core as the Pentium II and Celerons of the day, the Pentium II Xeon was introduced to offer a high-end workstation/server processor that could pick up where the Pentium Pro left off. 

One of the main goals behind the Xeon was to offer a processor that was powerful enough to handle the most CPU intensive workstation and server tasks while also retaining the features of the P6 core that allowed it to perform well on home/office tasks as well.  The idea of having a specialized computer for work but not being able to use it for your home/gaming applications was combated by the release of the Pentium II Xeon.  The Pentium II Xeon also helped to gain further ground in the multiprocessor workstation market which had been previously dominated by non-x86 offerings. 

The very first Pentium II Xeon had a full speed L2 cache of up to 2MB.  However because the 0.25-micron Pentium II die was already fairly large, the L2 cache wasn’t on die, rather it was contained in a separate chip that was connected to the CPU core by an external bus.  The Xeon family has definitely come a long way since its first days in 1998.  With the Pentium III Xeon’s shrink to a 0.18-micron process the processor core was able to house an on-die L2 cache of up to 2MB, tremendously increasing the cache performance of the platform. 

Today Intel is continuing their trend of segmenting their flagship processors by introducing the next-generation Xeon processor, based off of the Pentium 4’s Willamette core.  This processor, branded just as the Intel Xeon processor, is being launched at 1.4GHz, 1.5GHz and 1.7GHz and has a core that is almost identical to the current desktop Pentium 4 with a few minor changes.


Click to Enlarge

The Architecture of the Intel Xeon

The Intel Xeon processor shares the exact same core as the desktop Pentium 4, meaning that the same features the Pentium 4 can boast, the Xeon can do the same.  This also unfortunately means that the same shortcomings which affected the Pentium 4 will also affect the Xeon. 

We’ve explained the architecture behind the Pentium 4 many times, so here is a brief rundown of all of the major features behind the Pentium 4 and the Xeon:

Hyper Pipelined Technology – The Xeon features a much longer pipeline than either the Pentium III or the Athlon.  This unfortunately means that the Xeon accomplishes less per clock, however it does pave the way for the Xeon to achieve much higher clock speeds.  The theory behind this is that the enablement of much higher clock speeds will allow the Xeon to offer a greater performance advantage over its predecessors because being able to do less per clock doesn’t matter if you can hit incredibly high clock speeds.  Case in point would be that the Pentium III was only able to reach 1GHz on its 0.18-micron process while the Xeon is currently at 1.7GHz on the same 0.18-micron process.  And as you’re about to see, there is a clear performance difference between the two.

Improved Branch Prediction – Obviously with such a long pipeline, it is necessary to have an improved Branch Prediction Unit which the Xeon does boast.  The BPU is arguably the most advanced in this sector which is something that has held back the Athlon’s performance somewhat.  In any case, the Xeon’s BPU must be solid otherwise the penalties associated with its Hyper Pipelined Architecture would cripple the P4 beyond reparation.

Rapid Execution Engine – Two of the Xeon’s ALUs (Arithmetic Logic Units: they handle Integer operations) are double pumped, meaning they transfer twice as much data per clock effectively giving them throughput identical to that of ALUs operating at twice the core frequency.  In the case of the 1.7GHz Xeon, this means that the ALUs operate as if they were normal ALUs (not double pumped) clocked at 3.4GHz.  As we have discovered in the past, this is necessary in order to provide the Xeon with respectable performance when running Integer code.  Integer code is generally much more susceptible to mis-predicted branches, the lower latency/higher effective clocked ALUs allow the branch mis-predict penalties associated with the Xeon’s extremely long pipeline to be minimized when dealing with integer operations.

12K micro-op trace cache – This special cache replaces and improves upon the traditional L1 instruction cache.  The 8-way set associative Execution Trace Cache caches micro-ops after they have been decoded and they are also cached in the predicted path of execution.  This helps to hide some of the performance penalties caused by such a long pipeline.

256KB Advanced Transfer Cache – The Xeon’s L2 cache subsystem is quite incredible to say the least.  Not only does the processor have a 256-bit internal pathway to its L2 cache, it is also able to transfer data from the cache once every clock meaning that it has the highest peak cache bandwidth figures of any processor in its class.  At 1.7GHz, the Xeon has a maximum of 54.4GB/s of bandwidth to/from its L2 cache.  In comparison a Pentium III at 1.0GHz can only offer 16GB/s of bandwidth for L2 data transfers and similarly an Athlon at 1.33GHz can only offer 10GB/s of peak bandwidth (the Athlon only has a 64-bit datapath to its L2).

Hardware Prefetch – The Xeon is able to predict what data it will need before it is actually requested to get it from main memory and it will fetch it directly into cache, thus when it is requested the data is already in its cache.  In the event that the data isn’t needed, this becomes a waste of cache space and also FSB/memory bandwidth.  In either case, Hardware Prefetch is a FSB/memory bandwidth hog luckily this next feature of the Xeon architecture helps avoid that being a problem.

Quad Pumped 100MHz FSB + Dual Channel RDRAM – The Xeon has a 100MHz FSB that is quad pumped to offer data bandwidth equivalent to that of a 400MHz FSB, meaning it can transfer at most 3.2GB/s of data to the Xeon.  This bus runs synchronously with the i850’s (P4 chipset) dual channel RDRAM setup that runs at 400MHz over a 2 x 16-bit wide buses, for a total of 3.2GB/s of peak memory bandwidth.  While RDRAM was not necessary on the Pentium III platform, when coupled with the Xeon, the bandwidth RDRAM offers is very well appreciated.

SSE2 – The Xeon offers an improvement over the original 70 SSE instructions with its 144 new SSE2 instructions however even under SPEC CPU2000, the performance improvement offered by SSE2 optimizations alone is supposedly around 5%.  With SPEC CPU2000 being a highly synthetic benchmark, it is unlikely that SSE2 would translate into any real world performance gains in today’s applications.  One thing that isn’t being taken into account here is SSE2’s ability to handle two 64-bit SIMD-Int and SIMD-FP (Single Instruction Multiple Data; click here for an explanation) operations.  This ability isn’t being taken advantage of in SPEC CPU2000 and could prove to be one of SSE2’s greatest assets.



Jackson Technology: Not this time around

When we brought you our IDF coverage we had some very strong reasons to believe that the Intel Xeon processor would be the first to feature what is internally known as Jackson Technology.  As we explained before, Jackson technology is supposed to bring Simultaneous Multithreaded (SMT) functionality to a processor's core.  To give a brief overview, the limitation of a single processor is that on the hardware level it can only execute a single thread at one time.  The beauty of SMT is that it allows the processor to execute more than a single thread at once.  The theoretical number of instructions a processor can execute in a given clock cycle (IPC) compared to the processor's actual IPC is during real world usage is generally a very high ratio, simply because the processor is not always kept "busy" as in a good portion of its execution power is wasted.

By being able to execute, on a hardware level, multiple threads on a single processor concurrently, the processor's efficiency is increased dramatically.  This being the tangible benefit of SMT or Jackson technology. 

Jackson technology would make perfect sense to debut with a Dual Processor Xeon workstation since the type of applications a user investing in such a workstation would be running would be perfectly geared towards a SMT core.  Unfortunately, as we discovered a few months ago but were unable to share, the Intel Xeon processor being launched today does not have Jackson technology enabled. 

However it is still quite clear from the information we received at IDF as well as some confirmation from sources close to Intel that Jackson technology is indeed on the roadmap.  The technology would be a huge step forward for Intel and could potentially offer some very attractive performance figures.  Also remember that Intel has a habit of producing a single core and adapting it for use in all of the major market segments, meaning that there is a very good possibility that Jackson technology, upon its release, could find its way into the desktop Pentium 4 as well as the workstation and server Xeon parts.

Another possibility is that the current Willamette core does have Jackson technology implemented but not necessarily enabled on the core.  There are a number of things Intel could be waiting for before announcing/enabling Jackson support, in particular, software support. 

We are still eagerly anticipating the debut of Jackson technology, unfortunately we’re going to have to wait a bit longer for it.  Remember that there is a die-shrink coming up by the end of this year, and a Xeon MP (4+ processors) part with an on-die L3 cache coming out next year; both of those launches would be perfect for Jackson technology.  As usual, we will keep you updated on any findings in this area. 



A New Package

Historically, Intel's Xeon line of processors (e.g. Pentium II Xeon, Pentium III Xeon) have always used a different interface than their desktop counterparts.  While the Pentium II and Pentium III processors used a 242-pin Slot-1 connector, their Xeon brothers used a 330-pin Slot-2 connector.  Most of the additional pins were used to supply additional power to the chips; with 2MB of L2 cache, the Pentium III Xeon definitely drew more power than its 256KB sibling in the desktop world.

Pentium 4's Socket-423


Click to Enlarge

Xeon's Socket-603

The same change is taking place with the new Xeon.  The Pentium 4 uses a 423-pin socket interface while the new Xeon makes use of a new 603-pin interface.  As you might be able to guess 43% increase in pin count would make the chip incredibly large.  Although the core of the chip is much smaller than the physical chip itself, the number of pins that are required to interface with the board require that the chip be significantly larger than its core.  With a 603-pin interface using the same packaging as the Pentium 4, the Xeon would be incredibly large.  However Intel not only made an interface change with the Xeon but a packaging change as well.

Those that are familiar with the mobile market will know that there are two packages that mobile Pentium III processors are available in: micro Pin Grid Array (PGA) and micro Ball Grid Array (BGA).  The Pentium 4 implements a Pin Grid Array package while the Xeon uses a micro PGA interface reminiscent of the microPGA mobile CPUs.  Because of this, the actual size of the Xeon chip is no bigger than that of the Pentium 4; the pins on the bottom of the CPU are simply packed more densely.  The pins are also much shorter than those on the PGA Pentium 4, meaning that if you happen to bend any of the pins undoing the damage will be much more difficult than it was on a PGA chip.


Pentium 4 (left) vs Xeon (right)

This microPGA interface is much like what the upcoming 0.13-micron Pentium 4s will debut with, although they will have 478-pins vs. 603-pins on the Xeon.  Because of the change in packaging, the interface socket has changed as well.  The Socket-603 is the first major change to an Intel socket since the first Zero Insertion Force (ZIF) sockets were introduced with the 486.



A New Platform

When the original Pentium II Xeon made its launch, it did so alongside two new chipsets for the processor: the i440GX and the i450NX.  The i440GX chipset was nothing more than a server version of the desktop 440BX chipset with support for twice as much memory.  The i450NX chipset was a true server chipset as it supported 64-bit PCI, and quad Pentium II Xeon processors. 

The Intel Xeon processor allows history to repeat itself again as it makes its initial debut with nothing more than a server/workstation version of the desktop i850 chipset: the i860 chipset.


Click to Enlarge

The i860 does offer a few more enhancements over the desktop 850, primarily in its support for up to two 64-bit PCI buses alongside a single 32-bit PCI bus. 

The 64-bit PCI buses each require a separate chip to be installed on the motherboard, the PCI 64 Hub (P64H).  There can be up to two of these on the board itself, and each one has a 64-bit bus to the i860 Memory Controller Hub (MCH).  This bus operates at the FSB frequency of the system, meaning that it will offer 800MB/s of bandwidth to and from the 64-bit PCI slots per P64H on the motherboard.  Now if you’ll remember from our previous discussions of 64-bit PCI buses on motherboards, even when operating at 66MHz these buses will only consume 533MB/s at peak bandwidth.  This means that the bus connecting the 64-bit PCI hub to the MCH offers more bandwidth than the slots can consume.  This is the ideal case since there are no bandwidth bottlenecks holding the peak transfer rates of your PCI-64 cards back.

The fact that the MCH must be able to connect to two 64-bit PCI hubs via two 64-bit buses means that the i860 MCH is noticeably larger than i850’s MCH.

Other than the 64-bit PCI support, the i860 is no different than an i850 with support for dual processors.  This means that it also has the dual channel RDRAM memory bus as the i850 offering 3.2GB/s of memory bandwidth.  This pairs up perfectly with the 3.2GB/s of FSB bandwidth to the dual processors, which is actually the same amount of bandwidth as the single Pentium 4.  Unlike the upcoming 760MP, i860 implements a shared FSB among the dual processors meaning that both of the CPUs have to fight for the 3.2GB/s of FSB bandwidth that is available to them.  AMD’s 760MP does implement a point to point bus meaning that each of the processors installed on a 760MP board get a full 1.6GBs – 2.1GB/s of bandwidth.  Now before you immediately assume that the 760MP is better because of this, do remember that the two processors still have to fight for the same 2.1GB/s of memory bandwidth.

Quad Intel Xeon platform running on a Grand Champion HE Reference Board


Click to Enlarge

The other platform that will become available for the Intel Xeon is the upcoming ServerWorks Grand Champion HE which will debut with the Intel Xeon MP in quad processor configurations.  This chipset will use 4-way interleaved DDR SDRAM offering up to 6.4GB/s of memory bandwidth to the 1, 2 or 4 processors that the chipset supports.  For more information about the Grand Champion HE read our quick preview of it here.



The Motherboards

Interestingly enough, today's Xeon launch is met with motherboards from only two manufacturers: Iwill and TyanTyan did not have samples ready for review at press time (they have another interesting DP project they're working on) but Iwill managed to get us their DX400-SN.  This board is based off of the Intel Maplegrove reference design although Intel won't be manufacturing and selling a motherboard of their own based on the i860.  In fact, Intel won't be making any Xeon motherboards anytime soon since they won't make a ServerWorks based board for the Xeon MP that will debut late this year/early next year. 


Click to Enlarge

The board features eight RDRAM RIMM slots which is made possible by using two Memory Repeater Hubs (MRH) that split each of the two RDRAM channels into two more channels that can each support two RIMMs.  This is the only way that the board could gain acceptance in the server market since it isn't uncommon to have servers like this with multiple gigabytes of memory. 


Click to Enlarge

In order to make room for all of these memory slots they are located on a riser card that sticks up out of the board.  This unfortunately means that there is no hope of getting either of these boards (the Tyan is designed the same way) to fit in anything smaller than a 4U or 5U rackmountable case. 

Both the Iwill and the Tyan boards require the use of a WTX power supply which features a longer connector than an ATX power supply as well as a 4 x 2 secondary connector. We tested using a 430W WTX Power Supply.

The Iwill board was actually quite impressive from the smallest motherboard manufacturer in Taiwan.  We had no stability problems with the motherboard in spite of the tests we bombarded it with. 



The Quirks of MP

The major problem with evaluating a MP system such as the new Intel Xeon is how to measure performance.  In fact, most of the benchmarks from our usual test suite will not show any noticeable performance increase over a single processor.  Does that mean that you shouldn't concern yourself with this review since going MP won't give you any performance boost?  Not at all, in fact, there is a very good chance that you can improve your performance by going MP.

There are three requirements for enjoying the benefits of multiple processors:

1)      Operating system support - if your OS doesn't support multiple processors then you won't be able to take advantage of your second (or third/fourth etc…) CPUs at all; they will simply go unused.  None of the Windows 9x and ME OSes support MP however Windows NT, 2000, XP Professional (the home edition won't support it), Linux, Unix, etc… all support MP operation. 

2)      Application support - this isn't actually a requirement but it is ideal for getting the most out of multiple processors.  If your application is specifically designed for use with multiple processors (generically referred to as multithreaded since each CPU can only handle one thread at a time) then you will generally get a reasonable performance improvement by going MP.  Examples of such applications are the majority of database servers (e.g. Oracle, SQL), and some 3D rendering programs (e.g. 3D Studio MAX).  However not all "high-end" applications take advantage of having multiple processors, case in point would be PTC's Pro/ENGINEER

3)      The need for a second CPU - we just mentioned that you don't necessarily need application support to enjoy the benefits of having a second processor, which is true.  If your applications aren't specifically designed to take advantage of multiple CPUs then you have to at least be runni ng more than one application at a time in order to give your second CPU a workout.  This requirement is actually one of the most difficult to describe via benchmarks but it is arguably one of the most useful for AnandTech readers that aren't running DB servers since many of them use their systems in this manner.

With those three requirements in mind, we can start to take a look at benchmarking this monster.



The Test

We compared performance using four setups:

AMD Athlon-C 1.2GHz on an AMD 760 motherboard (we chose the 1.2GHz Thunderbird for future comparison)

Dual Intel Pentium III 933 on a VIA Apollo Pro266 motherboard.  As we've already discovered, the Pro266 offers similar performance to the i840 at a much lower cost.  And the Pentium III performs identically to the entry level Pentium III Xeons with 256KB L2 cache.

Dual Intel Xeon 1.7GHz on an Iwill 860 motherboard.

Single Intel Xeon 1.7GHz on an Iwill 860 motherboard.  This configuration performs identically to a single processor Pentium 4 running at 1.7GHz. 

We used 512MB of memory for all of the desktop/workstation tests and 1GB of memory for all of the server tests. 

Windows 98SE / 2000 Test System

Hardware

CPU(s)

Intel Pentium III 933MHz x 2 Intel Xeon 1.7GHz x 2 AMD Athlon-C "Thunderbird" 1.2GHz
Motherboard(s) Iwill DVD266-R Iwill DX400-SN ASUS A7M266
Memory

1GB PC2100 Corsair DDR SDRAM
1GB PC800 Toshiba RDRAM

Hard Drive

IBM Deskstar 30GB 75GXP 7200 RPM Ultra ATA/100

CDROM

Phillips 48X

Video Card(s)

NVIDIA GeForce2 Ultra 64MB DDR (default clock - 250/230 DDR)

Ethernet

Linksys LNE100TX 100Mbit PCI Ethernet Adapter

Software

Operating System

Windows 2000 Professional SP2
Windows 2000 Server SP2

Video Drivers

NVIDIA Detonator3 v6.50 @ 1024 x 768 x 16 @ 75Hz
NVIDIA Detonator3 v6.50 @ 1280 x 1024 x 32 (SPECviewperf) @ 75Hz
VIA 4-in-1 4.31V was used for all VIA based boards



Memory Bandwidth Comparison

As usual, we'll start off our evaluation of the Intel Xeon with some memory bandwidth benchmarks.  The Pentium 4 has already proved to be extremely appreciative of large amounts of memory bandwidth that is provided to it by the i850 chipset with its dual Rambus channels.  The Intel Xeon should be no different since the CPUs are identical and the available memory bandwidth is identical as well. 

The Linpack performance graph provides us with the answer we already knew; as the size of the data being worked on increases past what the CPUs can hold in their caches, the FPU throughput is determined by the amount of available memory bandwidth.

Here we can see exactly what will give the Xeon the edge over the Intel Pentium III (and the Pentium III Xeon) in memory/FSB bandwidth intensive server applications.  We will get into examples of these applications later on, but remember that video encoding isn't the only bandwidth intensive application out there.

The fluctuations in the Dual Xeon graph were not present if only a single processor was used.  Unfortunately we were unable to explain this phenomenon beyond relating directly to the juggling of tasks between the two processors. 

Earlier we mentioned that regardless of whether you have a point-to-point bus or a shared FSB, your CPUs are still going to have to contend with one another for the same amount of memory bandwidth through the same bus. 

A good way of illustrating this is by running two instances of Linpack simultaneously on the DP Intel Xeon system and note the drop in memory bandwidth available to each processor individually. 

By looking at the drop in performance after the matrix size exceeds the cache size of the individual processors you notice that the performance of the two processes is much less than the performance of a single process running on the system.  The reason for this being that the amount of memory bandwidth the i860's dual RDRAM channels are able to deliver still isn't enough to feed the extremely hungry processors. 

It is because of this that we see the need for the ServerWorks Grand Champion HE chipset with its 6.4GB/s of memory bandwidth when the Xeon is eventually placed in quad processor systems.



The important thing to take away from the above two bandwidth tests is that the Dual Xeon 1.7GHz has no more memory bandwidth than a single processor Xeon running at the same speed.  This unfortunately means that as clock speeds increase, a dual processor Xeon system will experience memory bandwidth limitations quicker than a single processor Xeon (or Pentium 4) system.

Another interesting thing to note is that the Dual Pentium III setup has the least amount of memory bandwidth of all four configurations; even less than the single processor Athlon.  This combined with the 1GB/s of bandwidth available to its FSB makes the dual processor Pentium III and Pentium III Xeon easily bottlenecked.



You'll remember from our Pentium 4 1.7GHz Review that BAPCo has really come around with their latest benchmark creation, SYSMark 2001.  The reason being that it benchmarks systems in the same manner that many enthusiasts use their computers: with multiple applications running at once.

Keep in mind that the third requirement we listed for allowing a performance boost to be seen from adding a second processor was that more than one application is running at once.  If SYSMark 2001 truly multitasks and stresses the CPUs, then we should see a performance increase from a single Xeon 1.7 to a dual Xeon 1.7. 

The Internet Content Creation suite of SYSMark 2001 revolves mostly around image/video editing as well as publishing.  However what gives the NetBurst architecture (and thus the Pentium 4/Xeon) the edge here is that a video is being encoded in Windows Media Encoder in the background while the benchmark is running.  This puts an extreme amount of stress on the FSB and memory bandwidth subsystems, allowing the Xeon to completely dominate here. 

The metric being displayed above is the average response time of the applications, not including the time spent waiting for user input.  This means that on average, moving from a single processor Xeon 1.7 to a dual processor Xeon 1.7 reduced the average response time was reduced by 27%.  While this isn't an incredible improvement, it does offer performance currently unavailable by any single processor CPU.

The SYSMark 2001 rating obviously mimics the average response time for the Internet Content Creation test.  If the Xeon continued to scale in this benchmark with clock speed as it has thus far (for results see our Pentium 4 1.7GHz Review), then the performance levels offered by this Dual Xeon at 1.7GHz would be greater than that of a single Xeon running at ~2.1GHz. 



The Office Productivity suite is much less of a niche application test than the Internet Content Creation suite and also depends much less on memory/FSB bandwidth.  As we noticed in our Pentium 4 1.7GHz Review, the Athlon was able to rise to the top here quite easily. 

In this case, the move to dual processors reduced the average response time by only 11%.  This indicates that in order to notice a performance boost from dual processors the workload must be stressful enough on the cache, memory and front side buses of the processors to truly saturate the CPUs.  The 11% lead here could have been just as easily gained by moving to a single, faster CPU (when available) for a much lower investment. 

What you gain from dual processors really depends on what you are throwing at them.

Again, these results mimic the average response time.  The Pentium 4 (and thus the Xeon) doesn't scale too well in this test, meaning that the 11% improvement provided by going to dual processors here would be very difficult to achieve with higher clocked CPUs.

If this is how you use your computer then you may be better off going AMD on this one.  The Athlon scales much better in this test because of its relatively short pipeline and large caches, allowing it to scale at 69% of the clock speed increase up to 1.33GHz.  Provided that this trend continues uninhibited, the Athlon is clearly the better processor for these sort of tasks while the Xeon and Pentium 4 take the Internet Content Creation category.



The overall SYSMark 2001 improvement seen by moving to a dual processor Xeon configuration ended up being 24%, however that number can be quite misleading depending on the type of applications you run.  If you are running (and multitasking with) more of the Internet Content Creation applications then moving to a DP Xeon may be of some worth to you.  However, if you find yourself better characterized by the Office Productivity suite then the DP Xeon will be overkill.

One of our returning favorites, Benchmark Studio, rounds off our evaluation of the Xeon as a high-performance desktop solution by showing us that the dual Xeon 1.7 can complete our MP torture test in 83% of the time as its single processor counterpart can. 



High-End Workstation Performance

Our next set of tests bring us to Ziff Davis' Dual Processor Inspection Tests that were a part of High-End Winstone 99.  These tests were composed of three applications, MicroStation SE (CAD), Photoshop 4.0 (image editing), and Visual C++ (development), all of which were specifically designed with MP support in mind.  This should give us a good idea of how well a dual Xeon configuration can perform in a situation where the applications that you're running are specifically designed for use with multiple processors.

Microstation SE is a CAD/modeling package that is extremely FPU intensive.  We have already seen indications that the FPU on Xeon (and Pentium 4) is quite poor in executing unoptimized x87 instruction code which is definitely present in the MicroStation SE test.  This allows the Athlon, with its extremely powerful FPU; to not only dominate the single processor Xeon but the dual processor Xeon and Pentium IIIs

Even the Dual Pentium III 933 is able to edge out the dual Xeon 1.7 because of the fact that it is stronger at unoptimized x87 calculations which compose most of today's FP intensive applications.  In order for the Xeon to succeed here, applications must take advantage of the SSE2 instructions supported by the Xeon.

The picture changes dramatically in the Photoshop test.  The most noticeable thing being that, just moving to dual processors allows the Xeon to enjoy a 46% performance boost. 

A 58% increase in performance can be seen under Visual C++.  Quite impressive.

Once again, when looking at the overall picture, the performance improvement the Xeon takes home is on average a little over 26% when using dual processors.  Just as was the case with SYSMark 2001, this figure is watered down considerably by the one test that did poorly on the Xeon.  We can only show you how the setup will perform in the various tests, it is up to you to decide what tests resemble the applications/patterns in which you use your computer.



Image Editing w/ Photoshop

The latest Photoshop 6.0.1 patch supposedly provided enhancements for Intel's NetBurst architecture, making it the perfect addition to our suite of tests.  Unfortunately this patch also caused our Dual Pentium III 933 platform to fail the Polar Coordinates conversion test which kept that system out of the final score.

Dual Intel Xeon 1.7GHz
Single Intel Xeon 1.7GHz
Dual Intel Pentium III 933MHz
Single Athlon-C 1.2GHz
Filter/Action
Time to Complete in Seconds (lower is better)
Rotate 90
7.9
7.8
6.8
10.0
Rotate 9
10.9
13.8
10.7
13.5
Rotate .9
11.1
13.0
10.3
12.4
Gaussian Blur 1 pixel
6.3
6.6
5.1
7.0
Gaussian Blur 3.7 pixels
10.5
11.8
11.4
15.2
Gaussian Blur 85 pixels
10.9
12.6
12.4
16.7
50%, 1 pixel, 0 level Unsharp Mask
4.4
5.3
4.4
5.5
50%, 3.7 pixel, 0 level Unsharp Mask
10.7
12.2
11.7
16.5
50%, 10 pixel, 5 level Unsharp Mask
10.9
12.7
11.9
16.1
Despeckle
6.7
9.5
7.1
8.1
RGB-CMYK
26.6
26.5
26.7
21.9
Reduce Size 60%
2.8
3.3
3.2
4.1
Lens Flare
12.6
16.0
14.4
17.6
Color Halftone
30.1
30.8
3.3
19.1
NTSC Colors
8.6
8.4
9.4
8.5
Accented Edges Brush Strokes
25.1
25.7
28.3
24.7
Pointillize
26.2
42.1
29.7
43.6
Water Color
54.1
54.4
58.5
48.9
Polar Coordinates
17.3
28.1
Failed
24.9
Radial Blur
57.5
101.8
70.2
108.1
Lighting Effects
6.4
7.7
8.4
15.7

Here the dual Xeon manages to complete all of the filters in less than 80% of the time of the single processor Xeon.  Another interesting note worth making is that with the new 6.0.1 patch, the Athlon-C 1.2GHz processor is just about as fast as a Pentium 4/Xeon at 1.7GHz. 



Linux Performance

For Linux benchmarking, we used the classic kernel compilation tests. Kernel compilation tests have the advantage of being able to specify the number of processes to run concurrently as well as being both CPU and memory bandwidth bound. The Xeon with its quad-pumped FSB should make much better use of its memory bandwidth than the Pentium III which hopefully is demonstrated by the compilation times.

This test is easily reproducable on your own machine. We used the latest released Linux kernel, 2.4.4, with default options obtained by running 'make menuconfig' and exiting without changing any values. You could also type 'make config' and hold the return key down to get the same results. To specify the number of concurrent processes make will spawn for compiling, use the -j flag such as: 'make -j 2 vmlinux' which specifies two processes.

To limit the ammount of CPU time held up in disk usage, we made sure to double check Red Hat's disk settings and step them up to 32bit I/O and UltraDMA mode 4. To do this, we used the following hdparm command:

hdparm -c1 -d1 -k1 -X68 /dev/hda

Note that I/O access will still measure into these tests, but since we used the same drive on every machine, it should not be a factor favoring either machine. Also, 512MB of RAM should be enough to ensure that make, gcc and a good number of the core include files stay in Linux's file system cache.

After we finished all the Athlon benchmarks, we went to start on the Xeon scores and ran into a wall. Neither the Red Hat 7.1 install kernel nor their install CDs would boot. Neither would the 2.4.4 kernel. Upgrading to 2.4.4-ac9 finally worked and we managed to test with that. So, note that these scores were obtained using two different kernels.

What is very interesting about this is that the 2.4.4 kernel we compiled was compiled to support Pentium 4 class CPUs. Thus, the kernel already included support for the base Pentium 4. The Xeon does not vary enough to warrant a failure, and the i860 chipset is only a slightly modified i850. Still, this is brand new hardware and maybe it's asking too much to expect older kernels to run on first try.

Linux Kernal Compilation Tests
Processor/Platform
Compile Time in Minutes (lower is better)
1 process
2 processes
3 processes
Dual Intel Xeon 1.7GHz
4.12
2.465
2.467
Dual Intel Pentium III 933MHz
5.09
3.12
3.135
Single AMD Athlon-C 1.2GHz
4.85
4.9
4.91
       

The Athlon benchmarks are meant to provide a basis for comparison between platforms. If you'll notice, the performance of the Single processor Athlon platform doesn't increase as the number of concurrent processes increases. This is because there is only one CPU to handle the processes, so adding more doesn't help the single CPU out any. Dual AMD vs. Intel benchmarks will have to wait for a future article.

In comparison to the Pentium 4 and the Athlon, the Pentium III is rather FSB and memory bandwidth starved with only 1GB/s available. We actually expected the Xeon to show a larger improvement moving from 1-process to 2-process bulids in
comparison with the Pentium III. Here, we can see that both the Pentium III and Xeon took just under 60% as long as a single process build. This tells us that that remaining 10% must be related to inherent inefficiencies associated with SMP and that a kernel compile must not stress FSB and memory bandwidth enough to show the Pentium III's weakness.



The Xeon goes real-world

So far the benchmarks we have shown you have been quite impressive however still very workstation oriented.  One of the biggest selling points of the Xeon family has been that it does make for some killer servers.  In the past, its large L2 cache (up to 2MB) has made it ideal for database servers but now with a new architecture behind it (NetBurst) could make it even more attractive.

On the desktop side, Intel's NetBurst architecture has been relatively unimpressive.  Part of the reason for this lack of enthusiasm has been that where the architecture excels in today on the desktop are in niche applications such as video encoding.  The problem with this is that there are relatively few bandwidth intensive applications on the desktop side of things, making the Pentium 4 and Xeon unattractive until they ramp up in clock speed even more. 

On the server side, applications are already extremely bandwidth intensive.  On your desktop there is never a program that requires multiple computers in order to run properly, yet in the server world this is a very common practice; it's called clustering.  In fact we employ quite a bit of that here at AnandTech

In order to measure the server performance of the Intel Xeon processor we threw it into a situation that we could actually gain some use out of.  Instead of running a slew of synthetic benchmarks here to simulate server performance, we actually threw it in the same position as one of the servers that make up our server farm here at AnandTech

The AnandTech Forums have increased in popularity tremendously over the years.  Today, it is home to close to 56,000 registered users, 600 of which are usually logged on at any given time.  There are thousands of guests browsing the forums as well who aren't accounted for in those statistics.  The Forums run FuseTalk, a database driven discussion package developed by our own Jason Clark and sold through his company e-Zone Media.

Whenever a message is posted to the AnandTech Forums, FuseTalk intercepts the message and writes it to a database.  Whenever a message or a thread (collection of messages) is read, it is pulled from that database and outputted through Cold Fusion dynamically.  There are no static HTML files that make up the forums; everything is served dynamically.  This does have its pros and cons; the biggest pro being that the system is quite robust, easy to search through and quite easy to keep running.  The biggest downside to this entirely dynamic, database driven Forums is that it puts an incredible amount of load on the database server that is actually serving out all the content.

The AnandTech Forums database is currently almost 3GB in size.  It contains 3.1 million publicly viewable messages organized into 357,000 threads.  Between its 56,000 users there exist a total almost 1.3 million private messages; all of which are stored in the AnandTech Forums database.

The database server itself is a dual processor Intel Pentium III 800 system with 1.5GB of PC133 SDRAM, however the processors aren't its strong point.  With a database server it is very easily to become bottlenecked by your I/O, meaning that your CPU and the rest of your system is waiting for data to be read off of your hard disk(s) before it can output anything.  In order to combat this, our database server features a four drive RAID 10 array of Quantum Atlas 10K II hard drives (10,000RPM, 8MB buffers). 

What does this have to do with the Intel Xeon processor and how it performs in a server scenario?  In order to truly test it and provide us with useful data as to how we would like to upgrade our Forums database server, we recorded a "trace" or a snapshot of every transaction handled by the AnandTech Forums Database Server over a period of 30 minutes.  This includes reads and writes to the AnandTech Forums database; basically everything that was posted, replied to, edited, sent or received within that 30 minute time period.

This snapshot was then played back on the test systems at the fastest the system could possibly play the trace.  If you've followed our benchmarks in the past, you will know that this is much like a "timedemo" under Quake III Arena.  The only difference between this trace playback and a Quake III Arena demo is that instead of performance being reported in frames per second, it is reported in time to complete.  If the trace took 30 minutes to record, it should take much less than that to complete since there is no waiting for users to input data as it is running as fast as it possibly could.

In order to minimize the I/O bottlenecks the test systems were not only outfitted with four Quantum Atlas 10K hard drives in RAID 0 (offering more write but similar read bandwidth than our Forums DB server's 4 drive RAID 10 array) but they were also given1GB of memory.

During the 30 minute recording there were: 105267 selects, 4984 updates, 701 inserts and 5 deletes performed on the database.  The names of the tasks describe exactly what they are; selects are reads, updates are reads and writes, inserts are writes and deletes remove data from the database (and are quite rare). 

The first thing to notice is that the test is extremely read intensive, meaning that the I/O bottlenecks aren't as great as if the test was more write intensive.  You can always read faster than you can write so this should mean that the test will be more dependent on a fast platform, provided that it isn't I/O bottlenecked from the start. 

If your particular database application is more write intensive the performance results should be similar in terms of the standings of the processors, but the performance gap will be decreased provided that the I/O doesn't change. 

The nature of the AnandTech Forums database is that there are very few computational intensive functions performed on the database; most of the functions are straight reads and writes.  This places the performance dependency on having a fast platform, not necessarily CPUs with powerful integer/floating point units.  If you've read this far then you know that the Xeon has arguably the most robust platform of anything in its class, meaning that it should perform quite well here. 

Now the moment you've been waiting for, does it?  In desktop applications the Pentium 4 at 1.7GHz is sometimes unable to hold much of a lead over the Pentium III it replaced.  Can the NetBurst Architecture of the Xeon prove to be useful in the world of database servers?



Database Server Comparison
Processor/Platform
Time to run 30 minute trace at full speed
(lower is better)
Single Intel Xeon 1.7GHz
22 minutes 31 seconds 532 ms
Dual Intel Xeon 1.7GHz
14 minutes 49 seconds 47 ms
Dual Intel Pentium III 933MHz
22 minutes 34 seconds 625 ms
Single AMD Athlon-C 1.2GHz
18 minutes 6 seconds 437 ms

There are a few important points to take away from these results. First of all, it is obvious that the incredible FSB and memory bandwidth that the Xeon platform offers is coming in handy quite a bit in this test. Even the single processor Xeon at 1.7GHz is able to complete the test in less time than the Dual Pentium III 933.

The Dual Intel Xeon running at 1.7GHz is able to complete the test in 64% of the time of both the single Xeon 1.7 and the dual Pentium III 933.

The Xeon isn't the only kid on the block with a decent amount of bandwidth at its disposal as the single processor Athlon-C is actually able to beat the single Intel Xeon 1.7 by 19%. One can only wonder what a pair of these 1.2GHz processors would be able to accomplish in this test...

Final Words

We originally wanted to include Web Server benchmarks in this review as well, unfortunately our test creation took longer than expected so you won't get to see the Xeon as a Web Server until our next MP review which should be coming up soon.

Currently the new Intel Xeon has very little competition. With the incredible amount of FSB and memory bandwidth the platform offers, the Intel Xeon is unstoppable. We have proved that this sort of processing power not only comes in handy in the server arena but also in desktop and high-end workstation environments as well. It really comes down to the types of applications you run, and whether your usage patterns would be better helped by adding a second processor.

However there are a few hurdles that the Xeon must overcome in order for it to be successful. If you haven't noticed, with only two motherboard manufacturers signed on to produce i860 motherboards it's pretty obvious that very few are taking the Xeon seriously. This isn't without good reason; with support for no more than 4GB of RAM and with RDRAM still priced higher than DDR SDRAM, the i860 platform isn't ideal for the truly high end servers that demand a minimum of 4GB of memory. Luckily this will be taken care of with the ServerWorks Grand Champion HE, leaving the i860 to take the workstation/entry-level server markets.

The real question on everyone's mind is how does the i860 and the Intel Xeon compare to the upcoming 760MP and the Athlon 4? We have been benchmarking that very combination for weeks now and soon enough we will be able to provide you with the definitive answer in many more test scenarios than those we just presented to you.

Log in

Don't have an account? Sign up now