Name: Workstation Graphics: AGP Cross Section 2004
Item: Workstation Graphics: AGP Cross Section 2004
Author: Derek Wilson

Original Link: https://www.anandtech.com/show/1575

Workstation Graphics: AGP Cross Section 2004

VIEW ARTICLE

by Derek Wilson on December 23, 2004 4:14 PM EST

Posted in
GPUs

25 Comments

Introduction

With the leap in performance that both ATI and NVIDIA made in desktop performance earlier this year, we were very excited about diving into workstation performance once boards were available. As usual, their workstation parts trailed their consumer parts in hitting the market place.

We have also been very keen on seeing what the new architecture from 3Dlabs has to offer in the form of the Wildcat Realizm 200 part. This 512MB workstation card is equipped with plenty of processing power and supports VS 2.0 and PS 3.0 level functionality. In bringing more pixel shader feature set support to the table than ATI, the Realizm and Quadro have an early advantage. Of course, the true test will be done in our performance tests.

Performance testing workstation level hardware has been very tricky in the past, but with the release of SPECviewperf 8.0.1, we were able to get a little help. SPEC really improved the quality of their benchmark from previous versions. In our opinion, it's now more reflective of real world performance than what previous versions have been. Given the difficulty associated with testing applications manually, we welcomed the inclusion of SPEC in our test suite. SPEC traces are taken from real applications, and the OpenGL commands are issued to the card (without the application itself running). The viewperf test is the only "synthetic" test that we use; all other tests performed are benchmarked within the application itself.

Today, we will be looking exclusively at the AGP lineup. Part of the decision to focus on AGP first has to do with platform. We were unable to get our hands on the type of system that we wanted to run for our first workstation graphics review in a PCI Express flavor in time for this article. On the AGP side, however, IWill was very happy to provide us with their DK8N motherboard. The board supports 2 Opteron processors, and we wanted to make sure that our test bed had ample CPU power to allow the graphics card to shine.

Each vendor offers much more powerful PCI Express versions of their card. 3Dlabs goes so far as to offer a multi-chip solution with two GPUs and a third chip called a vertex/scalability unit that handles vertex processing and division of labor among the two GPUs. We are very interested naturally in testing performance on the PCI Express workstation side as well, and we plan on doing a follow up to this article that targets just that.

For this article, we will start by looking at the architecture of each workstation GPU. The ATI and NVIDIA parts are based around their desktop parts, but we will give them a proper dissection here as well. We've never taken a look at the architecture of the 3Dlabs Wildcat Realizm part before now, so we'll begin there.

3Dlabs Wildcat Realizm 200 Technology

Fixed function processing may still be the staple of the workstation graphics world, but 3Dlabs isn't going to be left behind in terms of technology. The new Wildcat Realizm GPU supports everything that one would expect from a current generation graphics processor, including VS 2.0, and PS 3.0 level programmability. The hardware itself weighs in at about 150M transistors and is fabbed on a 130nm process. We're taking a look at the top end AGP SKU from 3Dlabs, but the highest end offering essentially places two of these GPUs on one board, connected by a vertex/scalability unit, but we'll take a look at that when we explore PCI Express workstations. The pipeline of the Wildcat Realizm 200 looks very much like what we expect a 3D pipeline to look.

When we start to peel away the layers, we see a very straightforward and powerful architecture. We'll start by explaining everything, except for the Vertex Array and the Fragment Array (which will get a little more detailed investigation shortly).

The Command Processor is responsible for keeping track of command streams coming into the GPU. Of interest is the fact that it supports the NISTAT and NICMD registers for isochronous AGP operation. This allows the card to support requests by applications for a constant stream of data at guaranteed intervals. On cards without isochronous AGP support, cards must be capable of handling arbitrarily long delays in request fulfillment based on the capabilities of the chipset. This feature is particularly interesting for real-time broadcast video situations in which dropped frames are not an option. Of course, it only works with application support, and we don't have a good test as to the impact of isochronous AGP operation either.

Visibility is computed via a Hierarchical Z algorithm. Essentially, large tiles of data are examined at a time. If the entire tile is occluded, it can all be thrown out. 3Dlabs states that their visibility algorithm is able to discard up to 1024 multi-samples per clock. This would fit the math for a 16x16 pixel tile with 4x multi-samples per pixel. This is actually 256 pixels, which is the same size block as ATI's Hierarchical Z engine. And keep in mind, these are maximal numbers; if the tile is partially occluded, only part of the data can be thrown out.

What 3Dlabs calls the Pixel Array, it is circuitry that takes care of AA, compositing, color conversion, filtering, and everything else that might need to happen to the final image before scan out. This is similar in function, for example, to the ROP in an NVIDIA part. The Wildcat Realizm GPU is capable of outputting multiple formats from the traditional to fp16 data. This means that it can handle things like 10-bit alpha blending, or 10-bit LUT for monitors that support it. On the basic hardware level, 3Dlabs defines this block as a 16x 16-bit floating point SIMD array. This means that 4 fp16 RGBA pixels can be processed by the Pixel Array simultaneously. In fact, all the programmable processing parts of the Wildcat Realizm are referred to terms of SIMD arrays. This left us with a little math to compute based on the following information and this layout shot:

Vertex Array: 16-way, high-accuracy 36-bit float SIMD array, DX9 VS 2.0 capable
Fragment Array: 48-way, 32-bit float component SIMD array, DX9 PS 3.0 capable

For the Vertex Array, we have a 16x 36-bit floating point SIMD array. Since there are two physical Vertex Shader Arrays in the layout of the chip as shown above, and we are talking about SIMD (single instruction multiple data) arrays, it stands to reason that each array can handle 8x 36-bit components at maximum. It's not likely that this is one large 8-wide SIMD block. If 3Dlabs followed ATI and NVIDIA, they could organize this as one 6-wide unit and one 2-wide unit, executing two vec3 operations on one SIMD block and two scalar operations on the other. This would give them two "vertex pipelines" per array. Without knowing the granularity of the SIMD units, or how the driver manages allocating resources among the units, it's hard to say exactly how much work can get done per clock cycle. Also, as there are two physically separate vertex arrays, each half of the vertex processing block is likely to share resources like registers and caches. It is important to note that the vertex engine here is 36-bits wide. The extra 4 bits, which are above and beyond what anyone else offers, actually delivers 32-bits of accuracy in the final vertex calculation. Performing operations at the same accuracy level as the data stored essentially builds in a level of noise to the result. This is because intermediate results of calculations are truncated to the accuracy of the stored data. This is a desirable feature to maintain high precision vertex accuracy, but we haven't been able to come up with a real world application that pushes other parts to a place where 32-bit vertex hardware breaks down and the 36-bit hardware is able to show a tangible advantage.

The big step for vertex hardware accuracy will need to be 64-bit. For CAD/CAM applications, a db of 64-bit values is kept. These double values are very important for manufacturing, but currently, graphics hardware isn't robust enough to display anything but single precision floating point data. Already high transistor counts would get unmanageable with current fab technology.

Other notable features of the vertex pipeline of the Wildcat Realizm products include support for 32 hardware fixed function hardware lights, and VS 2.0 level functionality with a maximum of 1000 instructions. Not supporting VS 3.0 while implementing full PS 3.0 support is an interesting choice for 3Dlabs. Right now, fixed function support is more important to the CAD/CAM market and arguably more important to the workstation market overall. But, geometry instancing could really help geometry limited applications when working with scenes full of multiple objects. Vertex textures might also be useful in the DCC market. Application support does push hardware vendors to include and exclude certain features, but we really would have liked to see full SM 3.0 support in the Wildcat Realizm product line.

The Fragment Array consists of three physical blocks that make up a 48-way 32-bit floating point SIMD array. This means that we 16x 32-bit components being processed in each of the three physical blocks at any given time. What we can deduce from this is that each of the three physical blocks share common register and cache resources and very likely operate on four pixels with strong locality of reference at a time. It's possible that 3Dlabs could organize the distribution of pixels over their pipeline in a similar manner to either ATI or NVIDIA, but we don't have enough information to determine what's going on at a higher level. We also can't say how granular 3DLab's SIMD arrays are, which means that we don't know just how much work they can get done per clock cycle. In the worst case, the Wildcat Realizm is equipped with 4x 4-wide SIMD units per physical fragment block. This would mean that operating on one component at a time would make 3 components idle while waiting for the fourth. It's much more likely that they implemented a combination of smaller units and are able to divide the work among them, as both ATI and NVIDIA have done. We know that 3Dlabs units are all vector units, which means that we are limited to combinations of 2-, 3-, and 4-wide vector blocks.

Unfortunately, without more information, we can't draw conclusions on DX9 co-issue or dual-issue per pipeline capabilities of the part. No matter what resources are under the hood, it's up to the compiler and driver to handle the allocation of everything that the GPU offers to the GLSL or HLSL code running on it. On a side note, it is a positive sign to see a third company confirm the major design decisions that we have seen both ATI and NVIDIA make in their hardware. With the 3Dlabs Wildcat Realizm 200 coming to market as a 4 vertex/12 pixel pipe architecture, it will certainly be exciting to see how things stack up in the final analysis.

The Fragment Array supports PS 3.0 and 256000 instruction length shader programs. The Fragment Array also supports 32 textures in one clock. This, again, seems to be a total limitation per clock. The compiler would likely distribute resources as needed. If we look at this in the same way that we look at the ATI or NVIDIA hardware, we'd see that we can access 2.6 textures per pixel. This could also translate to 2/3 of the components loading their own texture if needed, and if the driver supported it. 3Dlabs also talks about the Wildcat Realizm's ability to handle "cascaded dependant textures" in shaders. Gracefully handling a large number of dependent textures is useful, but it will bring shader hardware to a crawl. It's unclear how many depth/stencil operations that the Realizm is capable of in one pass, but there is a separate functional block for such processing shown on the image above.

One very important thing to note is that the Wildcat Realizm calculates all pixel data in fp32, but stores pixel data in an fp16 (5s10) format. This has the effect of increasing memory bandwidth from what it would be with fp32, while decreasing accuracy. The precentage by which percision is decreased depends on the data being processed and algorithms used. The fp32->fp16 and fp16->fp32 conversions are all done with zero performance impact between the GPU and memory. It's very difficult to test the accuracy of the fragment engine. We've heard that it comes out somewhere near 30-bit accuracy from 3Dlabs, and it very well could. We would still like to see an empirical test that could determine the absolute accuracy for a handful of common shaders before we sign off on anything. This is at least an interesting solution to the problem of balancing a 32-bit and 16-bit solution. Moving all that pixel data can really impact bandwidth in shader intensive applications, and we saw just how hard the impact can be with NVIDIA's original 32-bit NV30 architecture. We've heard that it is possible to turn off the 16-bit storage feature and have the Realizm GPU store full 32-bit precision data, but we have yet to see this option in the driver or as a tweak. Obviously, we would like to get our hands on a switch like this to evaluate both the impact on performance and image quality.

Another important feature to mention about the Wildcat Realizm is its virtual memory support. The Realizm supports 16GB of virtual memory. This allows big memory applications to swap pages out of local graphics memory to system RAM if needed. On the desktop side, this isn't something that we've seen a real demand or need for, but workstation parts definitely benefit from it. Speed is absolutely useful, but more important than speed is actually being able to visualize a necessary data set. There are data sets and workloads that just can't fit in local framebuffers. The Wildcat Realizm 200 has 512 MB of local memory, but if needed, it could have swapped paged out up to the size of our free idle system memory. The 3Dlabs virtualization system doesn't support paging to disk and isn't managed by windows, but the 3Dlabs driver. Despite the limitations of the implementation, the solution is elegant, necessary, and extraordinarily useful to those who need massive amounts of RAM available to their visualization system.

On the display side, the Wildcat Realizm 200 is designed to push the limits of 9MP panels, video walls, and multi-system displays. With dual 10-bit 400MHz RAMDACs, two dual-link DVI-I connections, and support for an optional Multiview kit with genlock and framelock capabilities, the Wildcat Realizm is built to drive vast numbers of pixels. The scope of this article is limited to single display 3D applications, but if there is demand, we may explore the capabilities of professional cards to drive extremely high resolutions and 2 or more monitors.

Even though it can be tough to sort through at times, this low level description of hardware is nicer than what we get from ATI and NVIDIA in some ways because we get a chance to see what the hardware is actually doing. The block diagram high level look that others provide us can be very useful in understanding what a pipeline does, but it obfuscates the differences in respective implementations. We would love to have a combination of the low level physical description of hardware that 3Dlabs has given us and high level descriptions that we get from ATI and NVIDIA. Of course, then we could go build our own GPUs and skip the middle man.

NVIDIA Quadro FX 4000 Technology

As with the ATI FireGL X3-256, NVIDIA's workstation core is based around its most recent consumer GPU. Unlike the top end offering from ATI, NVIDIA's highest end AGP workstation part is based on their highest end consumer level SKU. Thus, the Quadro FX 4000 has pixel processing advantage over the offerings from ATI and 3Dlabs in its 16 pipeline design. This will give it a shading advantage, but in the high end workstation space, geometry throughput is still most important. Fragment and pixel level impact has less effect in the workstation market than the consumer market, which is precisely the reason that last year's Quadro FX preformed much better than its consumer level partner. As with the ATI FireGL X3-256, since we're testing the consumer level part as well, we'll take a look at the common architecture and then hit on the additional features that make the NV40GL true workstation silicon.

As we see the familiar pipeline laid out once again, we'll take a look at how NVIDIA defines each of these blocks, starting with the vertex pipeline and moving down. The VS 3.0 capable vertex pipelines are made up of MIMD (multiple input multiple data) processing blocks. Up to two instructions can be issued per clock per unit, and NVIDIA claims that it is able to hide completely the latency of vertex texture fetches in the pipeline.

The side by side scalar and vector unit allow multiple instructions to be performed on different parts of the vertex at a time, if necessary (this is called co-issue in DX9 terminology). The 6 vertex units of the NV40 gives it more potential geometry power at a lower precision than the 3Dlabs part (on a component level, we're looking at a possible 24 32-bit components per clock). This does depend on the layout of 3Dlabs SIMD arrays and their driver's ability to schedule code for them. There is no hardware imposed limit on instructions that the vertex engine can handle, though currently software limits shader length to 65k instructions.

Visibility is computed in much the same way at the previous descriptions. The early/hierarchical z process eliminates blocks of pixels that are completely occluded and eliminates them from going through the pixel pipeline. For pixels that aren't clearly occluded, groups travel in quads four (a block of four pixels in a square pattern on a surface) through pixel pipelines. Each quad shares an L1 cache (which makes sense as each quad should have a strong locality of reference). Each of the 16 pixel pipelines looks like this on the inside:

The two shader units inside each pixel pipeline are 4 wide and can handle dual-issue and co-issue of instructions. The easy way to look at this is that each pipeline can optimally handle executing two instructions on two pixels at a time (meaning that it can perform up to 8 32-bit operations per clock cycle). This is only when not loading a texture, as texture loading will supercede the operation of one of the shader units. The pixel units are able to support shaders with lengths up to 65k instructions. Since we are not told the exact nature of the hardware, it seems very likely that NVIDIA would do some very complex resource management at the driver level and rotate texture loads and shader execution on a per quad basis. This would allow them to have less physical hardware than what software is able to "see" or make use of. To put it in perspective, if NVIDIA had all the physical processing units to brute force 8 32-bit floating operations in each of the 16 pipelines per clock cycle, that would mean needing the power of 128x 32 floating point units divided among some number of SIMD blocks. This would be approximately 2.7 times the fragment hardware packed in the Wildcat Realizm 200 GPU. In the end, we suspect that NVIDIA shares more resource than what we know about, but they just don't give us the detail to the metal that we have with the 3Dlabs part. At the same time, knowing how the 3Dlabs driver manages some of its resources in more detail would help us understand its performance characteristics better as well.

Moving on to the ROP pipelines, NVIDIA handles FSAA and z/stencil/color compression and rasterization here. During z only operations (such as in some shadowing and other depth only algorithms), the color portion of the ROP can handle z functionality. This means that the NV40GL is capable of 32 z/stencil operations per clock during depth only passes. This might not be as useful in the workstation segment as it is on the consumer side in games such as Doom 3.

The NVIDIA part also has the ability to support a 16-bit floating point framebuffer as the Wildcat Realizm GPU. This gives it the same functionality in display capabilities. The Quadro FX 4000 supports two dual-link DVI-I connectors, though the board is not upgradeable to genlock and framelock support. There is a separate (more expensive) product called the Quadro FX 4000 SDI, which has one dual-link DVI-I connector and two SDI connectors for broadcast video that supports genlock. If there is demand, we may compare this product to the 3Dlabs solution (and other broadcast or large scale visualization solutions).

It's unclear whether or not this part has the video processor (recently dubbed PureVideo) of NV40 built into it as well. It's possible that this feature was left out to make room for some of the workstation specific elements of the NV40GL. What, exactly, are the enhancements that were added to NV40 that make the Quadro FX 4000 a better workstation part? Let's take a look.

Hardware antialiased lines and points is the first and oldest component supported by the Quadro line that hasn't been enabled in the GeForce series. The feature just isn't necessary for gamers or consumer land applications, as it is used specifically to smooth the drawing of lines and points in wireframe modes. Antialiasing individual primitives is much more accurate and cleaner than FSAA algorithms, and is very desireable in applications where wireframe mode is used the majority of the time (which includes most of the CAD/CAM/DCC world).

OpenGL logic operations are supported, which allows things like hardware XORs to combine elements in a scene. Logic operations are performed between the fragment shader and the framebuffer in the OpenGL pipeline and have programmatic control over how (and if) data makes it down the pipeline.

The NV40GL supports 8 clip regions while NV40 only supports 1 clip region. The importance of having multiple clip regions is in accelerating 3D when overlapped by other windows. When a 3D window is clipped, the buffer can't be treated as one block in the frame buffer, but must be set up as multiple clip regions. On GeForce cards, when a 3D window needs to be broken up into multiple regions, the 3D can no longer be hardware accelerated. Though the name is similar, this is different than the also-supported hardware accelerated clip planes. In a 3D scene, a near clip plane defines the position beyond which geometry will be visible. Some applications allow the user to move or create clip planes to cut away parts of drawings and "look inside".

Memory management on the Quadro line is geared towards professional applications rather than towards games, though we aren't given much indication as to the differences in the algorithms used. The NV40GL is able to support things like quad-buffered stereo, which the NV40 is not capable of.

Two-sided lighting is supported in the fixed function pipeline on the Quadro FX 4000. Even though the GeForce 6 Series supports two-sided lighting through SM 3.0, professional applications do not normally implement lighting via shader programs yet. It's much easier and more straight forward to use the fixed function path to create lights, and hardware accelerated two-sided lighting is a nice feature to have for these applications.

Overlay planes are supported in hardware as well. There are a couple different options on the type of overlay plane to allow, but the idea is to have a lower memory footprint (usually 8bit) transparent layer rendered above the 3D scene in order to support things like pop-up windows or selection effects without clipping or drawing into the actual scene itself. This can significantly improve performance for applications and hardware that support its use.

Driver optimizations are also geared specifically towards each professional application that the user may want to run with the Quadro. Different overlay modes or other settings may be optimal for a different application. In addition, OpenGL, stability, and image quality are the most important aspects of driver development on the Quadro side.

ATI FireGL X3-256 Technology

The FireGL X3-256 is based on ATI's R420 architecture. While this isn't a surprise, it is interesting that the highest end AGP offering that ATI has on the table is based on the X800 Pro. On the PCI Express side, ATI is offering a higher performance part, but for now, the FireGL on AGP is a little more limited than on PCI Express. When we tackle the PCI Express workstation market, we'll bring out a clearer picture of how ATI's highest end workstation component stacks up against the rest of the competition. As the ATI part isn't positioned as an ultra high end workstation solution, we'll be focusing more on price performance. Unfortunately for ATI, the street price of the 3Dlabs Wildcat Realizm 200 comes in at just about the same as the Radeon FireGL X3-256 and is targeted at a higher performance point. But we'll have to see how that pans out when we've taken a look at the numbers. For now, let's pop open the hood on the ATI FireGL X3-256.

We will start out with the vertex pipeline as we did with the NVIDIA part. The overall flow of data is very similar to the Quadro, except, of course, that the ATI part runs with 12 pixel pipelines rather than 16. The internals are the differentiating factor.

We can see that the ATI vector unit supports the parallel operation of a 4x 32-bit vector unit and a 32-bit scalar unit. This allows the same type of operation that the NVIDIA GPU supports, but the FireGL lacks the VS 3.0 capabilities and support for vertex textures. Interestingly, in the documents that list the features of the FireGL X3, we see that "Full DX9 vertex shader support with 4 vertex units" is mentioned in addition to its "6 geometry engines". This obviously indicates that 2 of the geometry engines don't handle full DX9 functionality. This isn't of as much importance in a workstation part, as the fixed function path will be more often stressed, but it's worth noting that this core is based on the desktop part and we didn't pick up this information from any of our desktop briefings or data sheets.

The FireGL X3-256 employs the HyperZ HD engine that the Radeon uses, which combines early/hierarchical z hardware with a z/stencil cache and z compression. The hierarchical z engine looks at tiles of pixels (in the case of the FireGL 16x16 blocks), and if the entire block is occluded, up to 256 pixels can be eliminated in one clock. These pixels never need to touch the fragment/pixel processing hardware and save a lot of processing power. When we look at the pixel engine, we can see that ATI divides their pixels into "quad" pipes as well, but an NVIDIA and ATI quad is defined slightly differently. On ATI hardware, data out of setup is tiled into those 16x16 blocks for the hierarchical z pass. It's these blocks on which each quad pipe shares its efforts.

Inside each of the pixel pipes, we have something that also looks similar to the NVIDIA architecture. It is possible for ATI to handle completing two vector 3 operations and 2 scalar operations in combination with a texture operation every clock cycle. This is what the hardware ends up looking like:

Since the texture unit does not share hardware with either of the shader math units, ATI is able to handle theoretically more math per clock cycle in its pixel shaders than NVIDIA. The 3 + 1 arrangement is also not as robust as NVIDIA claims it to be, as NVIDIA is capable of handling 2 vector + 2 vector operations.

ATI is not as robust as either NVIDIA's architecture or 3Dlabs with only PS2.0 support. The FireGL can only support between 512 and 1536 shader instructions depending on the conditions, and uses fp24 for processing. The Radeon architecture has favored DirectX over OpenGL traditionally, so we will be very interested to see where these pre-dominantly OpenGL benchmarks will end up.

As far as rasterization is concerned, ATI does not support any floating point framebuffer display types. The highest accuracy framebuffer that the FireGL X3-256 supports is a 10-bit integer format, which is good enough for many applications today. As with both 3Dlabs' and NVIDIA's parts, the FireGL X3-256 includes dual 10-bit RAMDACs and 2 dual-link DVI-I connections allowing support of up to 9MP displays. Unlike the Wildcat Realizm and Quadro FX lines, there is no way to get any sort of genlock, framelock, or SDI output support for the FireGL line. This puts ATI behind when it comes to video editing, video walls, multi-system displays, and broadcast solutions.

The added features that ATI's FireGL X3-256 supports beyond the Radeon include:

Anti-aliased points and lines - Lines and points are smoothed as they're drawn in wireframe mode. This is much higher quality and faster than FSAA when used for wireframe graphics, and is of the utmost importance to designers who use workstations for wireframe manipulation (the majority of the 3D workstation market).
Two-sided lighting - In the fixed function pipeline, enabling two-sided lighting allows hardware lights to illuminate both sides of an object. This is useful for viewing cut-away objects. SM 3.0 supports two-sided lighting registers for programmable shaders, but these don't apply to the fixed function light sources.
OpenGL overlay planes - Overlays are useful for adding to a 3D accelerated viewport without making the buffer dirty. This can significantly speed up things like displaying pop-up windows or selection highlights in 3D applications.
6 user defined clip planes - User defined clip planes allow the cutting away of surfaces in order to look inside objects in application that support their creation.
Quad-buffered stereo 3D support - This enables smooth real-time stereoscopic image output by supporting a front-left, back-left, front-right, and back-right buffer for display.

Undoubtedly, the FireGL line also features a different memory management setup and driver development focuses more heavily on OpenGL and stability. This is quite a different market than the consumer side, but ATI has quite a solid offering with the strength of the FireGL X3-256. Of course, we would rather see a 16-pipeline part, but we'll have to wait until we evaluate PCI Express graphics workstations for that.

The Cards

We are focusing on 3 workstation cards today, and we will also be using 2 ultra high end desktop cards for reference. Here's the lineup of workstation parts that we will be looking at:

AGP Workstation Graphics Contenders
	3Dlabs Wildcat Realizm 200	ATI FireGL X3-256	NVIDIA Quadro FX 4000
Street Price	~$860	~$875	~$1600
Memory Size/Type	512MB GDDR3	256MB GDDR3	256MB GDDR3
Memory Bus	256bit	256bit	256bit
Memory Clock	500MHz	450MHz	500MHz
Core Clock	?	490	375
Vertex Pipes	4	6 (4 Full DX9)	6
Vertex Processing	36-bit	32-bit	32-bit
Pixel Pipes	12	12	16
Pixel Processing	32-bit / 16-bit storage	24-bit	32-bit / 16-bit selectable
Shader Model Support	VS 2.0 / PS 3.0	SM 2.0	SM 3.0
2x Dual-Link DVI	Yes	Yes	Yes
Stereo 3D	Yes	Yes	Yes
Genlock/Framelock	Multiview Upgrade	No	SDI version

Clearly, the competition is well matched, with the exception of price. Of course, we don't know the core clock speed of the 3Dlabs part, but recalling the architectural description, looking at the head-to-head numbers and noting the prices of the parts, the tests should prove to be very interesting. Street prices came from looking at what Google and pricewatch had to offer. Here are the images of the workstation cards that we'll be testing:

3Dlabs' Wildcat Realizm 200 High End AGP Workstation card

ATI's FireGL X3-256 AGP Workstation Card

NVIDIA's Quadro FX 4000 High End AGP Workstation Card

The two consumer level cards against which we chose to compare our workstation parts were the most powerful parts that we could dig up in our lab from the ATI and NVIDIA camp.

It used to be that NVIDIA sold the cheapest workstation components by simply overclocking their desktop part and enabling hardware point and line antialiasing. We can see with the most recent Quadro part that their philosophy has changed quite a bit. Unfortunately, we aren't able to force the GeForce 6 Series to hardware accelerate antialiased wireframe drawing, but the 6800 Ultra Golden Limited (is that name a descriptor or what?) that we received from Prolink does a very solid job of pushing data through. It has a 25MHz higher core clock than the 6800 Ultra and it sports 1.15 GHz data rate GDDR3 memory.

Prolink's GeForce 6800 Ultra Golden Limited

From the ATI camp, we employed the ever faithful HIS. Always the best choice in ATI overclocking, HIS shipped us one of their IceQ II series of parts. We had planned to do a roundup, but unfortunately, they were the only vendor to actually ship us a Radeon X800 XT Platinum Edition part. Luckily, the once out-of-work X800 XTPE has recently been re-adopted by ATI in the form of the X850 XT. They may be "different silicon", but to the end user, they are the same part. And so, we're testing with the HIS Radeon X800 XT PE IceQ II both on its legacy as a failed launch product, and as a glimpse at how the X850 XT will run workstation applications (provided ATI actually does get their cards out in January as said they would). Either way, this HIS part is the highest performing ATI based solution that we could dig up to compete with our workstation components.

HIS' Radeon X800 XT PE IceQ II

The Test

The test system is an IWill DK8N board powered by a 520W OCZ Powerstream PSU. The Dual Opteron 250 system had 2 GB of RAM (1 GB for each processor), and although the board supports NUMA, the feature was not enabled for this test. The IWill motherboard is simply an amazing workstation platform. It can handle up to 16GB of RAM, is loaded with PCI-X slots, and is jam-packed with features. Since the DK8N is a hybrid AMD chipset and nForce 3 motherboard, IWill is able to bring workstation users the best of the DP world and the desktop world in one package.

The dual configuration helps to keep the majority of the load on the graphics card in our testing. It may be interesting to experiment with single, dual and quad processor workstation scaling in the future. For now, this box will work beautifully for our tests.

The drivers that we chose to use for our workstation graphics cards were all beta or pre-release drivers, which each vendor assures us passes internal Q/A as far as image quality is concerned. NVIDIA sees the most performance improvement when moving from their 6x.xx series driver to the 70.41 series driver. In fact, when SPECviewperf 8 was launced in September, 3Dlabs Wildcat Realizm 200 cards lead performance in 7 out of 8 tests. The performance trends are quite different in today's lineup, as NVIDIA's driver team has done quite well to gain performance from professional level applications on the 6 Series architecture with the 7x.xx series driver. Of course, this makes us very interested in revisiting this test with a GeForce card when we have a 70 series ForceWare driver available.

Performance Test Configuration
Processor(s):	2 x AMD Opteron 250
RAM:	4 x 512MB OCZ PC3200 EL ECC Registered (2 per CPU)
Hard Drive(s):	Seagate 120GB 7200RPM IDE (8MB Buffer)
Motherboard & IDE Bus Master Drivers:	AMD 8131 APIC Driver NVIDIA nForce 5.10
Video Card(s):	3Dlabs Wildcat Realizm 200 ATI FireGL X3-256 NVIDIA Quadro FX 4000 HIS Radeon X800 XT Platimum Edition IceQ II Prolink GeForce 6800 Ultra Golden Limited
Video Drivers:	3Dlabs 4.04.0608 Driver ATI FireGL 8.08-041111a-019501E Performance Driver NVIDIA Quadro 70.41 (Beta) NVIDIA ForceWare 67.03 (Beta) ATI Catalyst 4.12
Operating System(s):	Windows XP Professional SP2 (without pae kernel)
Motherboards:	IWill DK8N v1.0 (AMD-81xx + NVIDIA nForce 3)
Power Supply:	520W OCZ Powerstream PSU

And to power our monster of a system, we needed a PSU that could deliver the juice. Once again, we turned to our OCZ Powerstream PSU. Even with 2 Opteron 250s, a GeForce 6800 Ultra, 2GB of RAM, and a couple of drives attached, the OCZ power supply had no problem keeping our machine fed. More importantly, the modular connectors allow us to hook up our PSU to a standard 20-pin ATX, 24-pin ATX12V like 915/925/nforce 4 boards use, and the 24-pin EPS12V that most workstation boards require.

We chose to run with a desktop resolution of 1280x1024x32 @85Hz. All the Windows XP eye candy was turned off and tuned for performance. Our virtual memory pagefile was set to 4092MB min and max, and system restore was turned off. After all applications were installed and all benchmarks were run once, the system was defragmented.

SPECViewperf 8.0.1 Performance

For our SPECviewperf tests, we will look at graphs of the overall weighted scores for each viewset. We have also listed the scores for the individual tests for each viewset. For each section, we will begin by listing the description of the viewset from the SPEC website, and then analysing the data.

For futher details on how SPECviewperf scores are compiled, please see the SPEC website.

All cards were set to their default professional graphics settings, whatever those happened to be, before running SPECviewperf.

3dsmax Viewset (3dsmax-03)

"The 3dsmax-03 viewset was created from traces of the graphics workload generated by 3ds max 3.1. To ensure a common comparison point, the OpenGL plug-in driver from Discreet was used during tracing.

The models for this viewset came from the SPECapc 3ds max 3.1 benchmark. Each model was measured with two different lighting models to reflect a range of potential 3ds max users. The high-complexity model uses five to seven positional lights as defined by the SPECapc benchmark and reflects how a high-end user would work with 3ds max. The medium-complexity lighting models use two positional lights, a more common lighting environment.

The viewset is based on a trace of the running application and includes all the state changes found during normal 3ds max operation. Immediate-mode OpenGL calls are used to transfer data to the graphics subsystem."

The Wildcat Realizm 200 leads in this benchmark, followed by the Quadro FX 4000. This is quite interesting, since by street price, the Realizm is the cheapest card that we have in the bunch. ATI's MSRP is lower than the 3Dlabs part, but value is all in purchasing power and performance.

The DCC side of the workstation market isn't as large as CAD/CAM, but 3DStudio is still an important application. Of course, this viewset tests straight OpenGL performance, and each of these workstation cards have a custom driver for 3DStudio Max, which we will test in the SEPCapc portion of our benchmark suite.

SPECviewperf 8.0.1

SPECviewperf 8.0.1 3dsmax-03
	1	2	3	4	5	6	7	8	9	10	11	12	13	14
Quadro FX 4000	38.6	38.5	21.8	18.8	75	53.7	40.6	30.6	17.2	70.3	25.3	26.2	21	20.1
Wildcat Realizm 200	38.6	38	30.2	24.3	74.1	62.1	43.5	40.4	28.3	68.7	24.1	24.1	18.3	18.3
FireGL X3-256	32.2	27.3	23.4	18.5	62.8	51	28.9	31.9	18.6	66.6	20.5	20.6	16	16.1
GeForce 6800 U GC	17.5	15.9	14.8	14.7	42.4	29.2	27.1	19.1	14.9	40.8	14.2	14.1	11.5	11.2
Radeon X800 XTPE	12	8.09	12.5	10.7	35.9	23.6	11.6	15.8	10	32.7	11.4	6.37	9.3	7.86

CATIA Viewset (catia-01)

"The catia-01 viewset was created from traces of the graphics workload generated by the CATIATM V5R12 application from Dassault Systemes.

Three models are measured using various modes in CATIA. Phil Harris of LionHeart Solutions, developer of CATBench2003, supplied SPEC/GPC with the models used to measure the CATIA application. The models are courtesy of CATBench2003 and CATIA Community.

The car model contains more than two million points. SPECviewperf replicates the geometry represented by the smaller engine block and submarine models to increase complexity and decrease frame rates. After replication, these models contain 1.2 million vertices (engine block) and 1.8 million vertices (submarine).

State changes as made by the application are included throughout the rendering of the model, including matrix, material, light and line-stipple changes. All state changes are derived from a trace of the running application. The state changes put considerably more stress on graphics subsystems than the simple geometry dumps found in older SPECviewperf viewsets.

Mirroring the application, draw arrays are used for some tests and immediate mode used for others."

The state change information found in the CATIA is a big part of the improvement in SPECviewperf 8. The nature of OpenGL as a state machine is the major advantage of the API in workstation applications. The viewing and manipulation of state are key elements in workstation graphics today. Taking a look at performance here, we see the Quadro FX 4000 taking the lead on this one followed by the Realizm 200.

SPECviewperf 8.0.1

SPECviewperf 8.0.1 catia-01
	1	2	3	4	5	6	7	8	9	10	11
Quadro FX 4000	48.9	37.2	26.9	27.9	19	47.8	29.6	28.9	19	23.2	46.1
Wildcat Realizm 200	52.1	38.6	15.8	29.7	18.9	63.3	32.5	27.6	22	20.6	13.7
FireGL X3-256	39.7	25.8	19.1	22.8	16.2	34	25.2	18.8	15	19.1	34.8
GeForce 6800 U GC	20.2	22.5	11.5	12.6	8.9	31.1	17.2	4.49	11.2	12.7	27.2
Radeon X800 XTPE	20.9	10.3	10.9	11.7	7.38	26.7	13.1	12.5	8.86	7.41	21

EnSight (ensight-01)

"The ensight-01 viewset replaces the Data Explorer (dx) viewset. It represents engineering and scientific visualization workloads created from traces of CEI's EnSight application.

CEI contributed the models and suggested workloads. Various modes of the EnSight application are tested using both display-list and immediate-mode paths through the OpenGL API. The model data is replicated by SPECviewperf 8.0 to generate 3.2 million vertices per frame.

State changes as made by the application are included throughout the rendering of the model, including matrix, material, light and line-stipple changes. All state changes are derived from a trace of the running application. The state changes put considerably more stress on graphics subsystems than the simple geometry dumps found in older viewsets.
Mirroring the application, both immediate-mode and display-list modes are measured."

This time around, we see the FireGL X3-256 jump to the head of the pack. Even though the X3-256 has trailed in the previous tests, it has managed to clearly lead the EnSight performance tests.

SPECviewperf 8.0.1

Six out of the nine tests are led by the FireGL X3. Interestingly, in tests 5, 6, and 8, the Realizm 200 leads both the FireGL X3-256 and the Quadro FX 4000. Test 1 has a weight of 12, while the following 8 tests have a weight of 11 each.

SPECviewperf 8.0.1 ensight-01
	1	2	3	4	5	6	7	8	9
Quadro FX 4000	42.7	35.4	35.6	31.1	13.7	13.6	31	13.8	31
Wildcat Realizm 200	24.5	20.9	26.7	23	16	15.9	23	16	23
FireGL X3-256	53.6	50	53.7	44.6	12.3	12.3	44.7	12.3	44.6
GeForce 6800 U GC	11.6	3.98	39.3	34.1	7.22	7.23	34	7.26	34
Radeon X800 XTPE	14.1	54.4	45.5	34.8	5.84	5.78	34.8	5.8	34.8

Lightscape Viewset (light-07)

"The light-07 viewset was created from traces of the graphics workload generated by the Lightscape Visualization System from Discreet Logic. Lightscape combines proprietary radiosity algorithms with a physically-based lighting interface.

The most significant feature of Lightscape is its ability to accurately simulate global illumination effects by pre-calculating the diffuse energy distribution in an environment and storing the lighting distribution as part of the 3D model. The resulting lighting "mesh" can then be rapidly displayed."

The Quadro FX 4000 outpeforms the Wildcat Realizm by about 3.4% in the lighscape test. This time, the FireGL X3-256 is back where we would expect based on ATI's positioning of the product.

SPECviewperf 8.0.1

SPECviewperf 8.0.1 light-07
	1	2	3	4	5
Quadro FX 4000	23.7	37.5	18.1	13.8	27.8
Wildcat Realizm 200	25.4	44.3	14.1	12.8	25.8
FireGL X3-256	24.8	44.5	12.5	11.6	24.2
GeForce 6800 U GC	13.2	24.2	9.06	6.75	16
Radeon X800 XTPE	12.1	22.6	7.96	5.89	14.2

Maya Viewset (maya-01)

"The maya-01 viewset was created from traces of the graphics workload generated by the Maya V5 application from Alias.

The models used in the tests were contributed by artists at NVIDIA. Various modes in the Maya application are measured.

State changes as made by the application are included throughout the rendering of the model, including matrix, material, light and line-stipple changes. All state changes are derived from a trace of the running application. The state changes put considerably more stress on graphics subsystems than the simple geometry dumps found in older viewsets.
As in the Maya V5 application, array element is used to transfer data through the OpenGL API."

The clear leader in this test is the Quadro FX 4000. The NVIDIA part leads every test under Maya by a large margin. The differences between each card are very large all the way down to the X800 XT PE.

SPECviewperf 8.0.1

SPECviewperf 8.0.1 maya-01
	1	2	3	4	5	6	7	8	9
Quadro FX 4000	152	52.5	37.2	42.6	27.6	100	82.3	82.7	49.6
Wildcat Realizm 200	126	41.1	30.8	33.3	17.7	96.3	79.8	76.8	44.7
FireGL X3-256	110	33.3	23.2	29.2	20.6	82	51.1	63.7	37.2
GeForce 6800 U GC	54	24.5	16	12.8	10	52	39.5	37.4	23.7
Radeon X800 XTPE	25.2	14.9	9.02	8.03	7.2	36.1	26.7	25	16.3

Pro/ENGINEER (proe-03)

"The proe-03 viewset was created from traces of the graphics workload generated by the Pro/ENGINEER 2001TM application from PTC.

Two models and three rendering modes are measured during the test. PTC contributed the models to SPEC for use in measurement of the Pro/ENGINEER application. The first of the models, the PTC World Car, represents a large-model workload composed of 3.9 to 5.9 million vertices. This model is measured in shaded, hidden-line removal, and wireframe modes. The wireframe workloads are measured both in normal and antialiased mode. The second model is a copier. It is a medium-sized model made up of 485,000 to 1.6 million vertices. Shaded and hidden-line-removal modes were measured for this model.

This viewset includes state changes as made by the application throughout the rendering of the model, including matrix, material, light and line-stipple changes. The PTC World Car shaded frames include more than 100MB of state and vertex information per frame. All state changes are derived from a trace of the running application. The state changes put considerably more stress on graphics subsystems than the simple geometry dumps found in older viewsets.

Mirroring the application, draw arrays are used for the shaded tests and immediate mode is used for the wireframe. The gradient background used by the Pro/E application is also included to model the application workload better."

This is an absolutely huge test in terms of data and workload placed on the cards. The Quadro FX 4000 leads here, but the 3 workstation cards really are close and excel in different areas. Check out the table and see what we mean.

SPECviewperf 8.0.1

Tests 1, 2 and 3 are clearly led by the FireGL X3-256. Unfortunately for ATI, the rest of the tests fall very short. The Wildcat Realizm 200 is much closer to the Quadro FX 4000 than it appears in the overall result. It leads 3 out of 7 benchmarks and is close in two of its losses. Test number 6 really pushes the Quadro up in score. It is data like this that gets lost in geometric means sometimes.

SPECviewperf 8.0.1 proe-03
	1	2	3	4	5	6	7
Quadro FX 4000	26.6	31	24.9	53.4	53.2	170	52.6
Wildcat Realizm 200	27.1	31.8	24.5	54.2	45.2	143	51.3
FireGL X3-256	38.3	45.9	31.7	38.4	36	135	42.9
GeForce 6800 U GC	11.2	13.1	15.3	33.5	8.55	88.5	32.7
Radeon X800 XTPE	7.33	8.63	9.05	22.7	22.5	44.2	23.8

SolidWorks Viewset (sw-01)

"The sw-01 viewset was created from traces of the graphics workload generated by the Solidworks 2004 application from Dassault Systemes.

The model and workloads used were contributed by Solidworks as part of the SPECapc for SolidWorks 2004 benchmark.

State changes as made by the application are included throughout the rendering of the model, including matrix, material, light and line-stipple changes. All state changes are derived from a trace of the running application. The state changes put considerably more stress on graphics subsystems than the simple geometry dumps found in older viewsets.

Mirroring the application, draw arrays are used for some tests and immediate mode used for others."

The Realizm 200 comes out on top again in this benchmark, followed by the Quadro FX 4000, with the ATI card bringing up the rear. Here, we see a situation similar to the proe test when we look at the table of data.

SPECviewperf 8.0.1

The Quadro FX 4000 leads in tests 2, 3, and 4. Test 7 is really what causes the major dichotomy in scoring between the two parts in the overall test. The Wildcat Realizm 200 is still the better part for the job under the SolidWords viewset, but it's good to step back and take a look at the individual test data.

SPECviewperf 8.0.1 sw-01
	1	2	3	4	5	6	7	8
Quadro FX 4000	34.4	12.4	14.2	17.2	40.5	28.9	49.3	21.3
Wildcat Realizm 200	43.4	9.76	12.6	15.5	48	32.7	103	22.6
FireGL X3-256	26.6	12.1	15.3	18.5	39.5	25.5	52.4	18.2
GeForce 6800 U GC	31.8	10.6	10.6	13.8	32.8	12.2	33.4	12.3
Radeon X800 XTPE	23.7	6.04	6.37	8.04	18.3	11.5	52.1	11.1

Unigraphics (ugs-04)

"The ugs-04 viewset was created from traces of the graphics workload generated by Unigraphics V17.

The engine model used was taken from the SPECapc for Unigraphics V17 application benchmark. Three rendering modes are measured: shaded, shaded with transparency, and wireframe. The wireframe workloads are measured both in normal and anti-alised mode. All tests are repeated twice, rotating once in the center of the screen and then moving about the frame to measure clipping performance.

The viewset is based on a trace of the running application and includes all the state changes found during normal Unigraphics operation. As with the application, OpenGL display lists are used to transfer data to the graphics subsystem. Thousands of display lists of varying sizes go into generating each frame of the model.

To increase model size and complexity, SPECviewperf 8.0 replicates the model two times more than the previous ugs-03 test."

The NVIDIA workstation solution is much better able to handle this workload. The huge models with shaded with transparencies turned on are easily rendered on the Quadro FX 4000.

SPECviewperf 8.0.1

SPECviewperf 8.0.1 ugs-04
	1	2	3	4	5	6	7	8
Quadro FX 4000	30.1	34.7	27.8	31.2	51	62.2	47.5	49.8
Wildcat Realizm 200	24.3	26.8	22	23.2	38.8	31.4	30.2	26.7
FireGL X3-256	18.5	19.1	13.9	14.4	53.4	56.4	53.7	53.3
GeForce 6800 U GC	3.29	3.86	3.01	3.52	27.3	31.9	4.94	7.38
Radeon X800 XTPE	11.9	12	9.13	9.13	22.2	22.6	22.2	22.4

To sum up SPECviewperf 8.0.1 Performance, we have the FireGL X3-256 Leading 1 benchmark (ensight), the Wildcat Realism 200 leading 2 benchmarks (3dsmax, sw), and the Quadro FX 4000 taking pole position on the other 5 tests.

Now, let's put the cards to the test inside applications and see if performance holds up.

SPECapc 3D Studio Max 6 SP1 Performance

We ran the SPECapc 3D Studio Max benchmark using both the built in OpenGL driver for workstation cards, as well as vendor supplied custom drivers. We made sure to follow SPEC compliant settings for each driver. Consumer level cards were run using the D3D driver set on DX9. The custom drivers for each vendor improved performance and quality for each part that we tested, and we would recommend using the plugins if possible over the default OpenGL driver.

When we look at 3DStudio Max Performance, the Quadro FX 4000 clearly comes out on top overall. With respect to wireframe graphics, the FireGL X3-256 has the upper hand, but the Wildcat Realizm 200 doesn't come close in anything but the Object Creation/Editing/Manipulation test. The default OpenGL driver running on the Wildcat Realizm 200 performs particularly poorly, coming in at or around the performance of the consumer level products. In both the custom and standard OpenGL driver, the transparency/opacity tests were slower than the other cards tested.

These benchmarks look quite different than the SPECviewperf 3dsmax test because we are focusing on custom driver performance under particular settings. For more information, see the SPEC website.

We do need to note one issue with the Wildcat Realizm custom driver under 3DStudio Max R6 when paired with their latest few drivers. Rotating, and manipulating objects in certain ways results in the object flickering in the viewport. It doesn't look like double buffering is disabled, but the effect is reminiscent of single buffered graphics. It could be possible that under certain conditions, the image is copied straight to the front buffer, but it seems like that would cause more performance and stability problems than it would solve. We can't really figure out what's going on, but we've asked Creative and 3Dlabs to look into the issue for us.

3DStudio Max 6 SPECapc

SPECapc Maya 5.01 Performance

The NVIDIA Quadro FX 4000 comes out on top once again in the Maya SPECapc test. The Realizm 200 is able to best the Quadro in the hand1.ma test, but all other tests go in favor of NVIDIA's parts.

Looking deeper into the Maya tests (which we have not listed here), we were able to see that the Quadro and Realizm perform just about dead evenly in smooth shaded operation under the Maya 5.0 SPECapc. Wireframe operation with both GPUs is also evenly matched in everything, but the Insect.ma test (which NVIDIA led).

The performance factor that pushed everything over the top seems to be the way that NVIDIA is able to handle the selected and hilted modes in Maya. The Quadro FX 4000 was able to beat its competitors every time in these tests.

Maya 5.1 SPECapc

AutoCAD 2004 C2001 Performance

The C2001 test is Copyright 1996 - 2001 by Art Liddle and CADALYST magazine. The benchmark loads different models into AutoCAD, manipulates them in wireframe and shaded modes, and then scores the graphics card based on internal performance metrics. We'd like to thank CADALYST for allowing us to use their benchmark.

The Wildcat Realizm 200 comes in low on the charts here. The FireGL and Quadro 4000 prove themselves to be better cards under AutoCAD 2004 as tested with the C2001 benchmark. Interestingly, the desktop cards perform rather well using AutoCAD 2004's default heidi driver, with the desktop 6800 Ultra coming in second overall. The Wildcat Realizm 200 had trouble with the 3D wireframe mode. 2D performance characteristics seem almost reversed from 3D, but the differences are not as pronounced either.

AutoCAD 2004 Performance

Pro/ENGINEER Wildfire 2.0 OCUS Performance

While the OCUS benchmark is a very good test for medium-sized workloads with Pro/E Wildfire, something more like the SPECapc for Pro/E 2001 would push the large memory of the Realizm 200 a little harder. We may include the SPECapc test in future performance analyses as well. For more information on the OCUS benchmark, check out Olaf Corten's ProESite.

As the performance tests show, the Quadro 4000 takes the least amount of time to complete the benchmark than the other cards in the test. The Realizm 200 is only 8 seconds behind, followed by the overclocked 6800 Ultra. The FireGL X3-256 falls behind in this benchmark, but the top two contenders are fairly evenly matched.

OCUS Benchmark Graphics Performance

Doom 3 Performance

Game engine performance is not particularly important for the sake of gaming on workstation cards. It is, however, important for game developers. We wanted to benchmark some game development applications such as RenderWare Studio. Unfortunately, we haven't been able to come up with a suitable benchmark for such an application yet.

As luck may have it, though, game developers usually end up using development tools that run the game engine for display. This makes game tests a useful substitute for the quality of game specific development tools. In other words, developers who adopt the Doom 3 engine will need to be running a system that can perform acceptably under the Doom 3 game engine if they want their development tools to run well.

And so, we bring you Doom 3 running on workstation graphics cards for the game development professional.

As expected, the consumer level cards outperform the workstation class cards in this test. The Quadro FX 4000 is able to push very acceptable frame rates under Doom 3 at 1600x1200. The ATI FireGL X3-256 gets by with about 30fps, but the Wildcat Realizm 200 performs abysmally at 5.4 fps. The picture quality is perfect, but the speed is horrendous.

We had hoped that the Realizm 200 would be able to handle a game based around the OpenGL API. Doom 3 performance is likely pushing the z/stencil capabilities of the Realizm 200 beyond what it is able to handle.

Half-Life 2 Performance

Continuing our look into game development performance, we take a look at Half-Life 2. This DX9 based game actually runs fine on all the cards that we tested. We expected there to be some visual issues with the Realizm 200 (based on what we will see in Shadermark performance on the next page), but everything looked exactly as what we would expect.

Performance is still horrid for the 3Dlabs part. Anyone wanting to run a box that combines DCC and any in-engine game development tools should absolutely stick to either the ATI or NVIDIA workstation solutions.

Half-Life 2 Performance

Shader Analysis

To open this section, we would first like to start by stating that we wish we could have found a suitable benchmark to test GLSL performance in a similar way that Shadermark manages to test HLSL performance. OpenGL fragment shading performance under which we ran the demos on the Wildcat Realizm part is much higher than its DirectX pixel shading performance under Shadermark. In fact, even in playing with ATI's own Rendermonkey, it was apparent that the 3Dlabs card handled GLSL shaders better than the FireGL X3. Since OpenGL is the language of the workstation, it makes sense that the workstation is only where 3Dlabs would focus its efforts first, while ATI's consumer oriented approach would lend it the clear upper hand in DirectX HLSL benchmarks like Shadermark.

But DirectX and HLSL is still a very relevant test and is supported on all these cards. Of note is the fact that Shadermark would not run PS 3.0 code on the Wildcat Realizm. Shadermark chose to use the PS 2.0a path, which supports a wider range of features than the PS 2.0b path used for both of the ATI cards. Shadermark has been known to be very picky about code paths that it runs, and it's possible that there is an issue with the fact that this 3Dlabs part is simply not on the Shadermark map. But part of the point of HLSL is that the code should still run with no problems. We did get the option of creating an A2R10G10B10 hardware device on the Wildcat Realizm in shadermark where no other card presented such a feature. But let's take a look at what the numbers have to say.

Shadermark v2.1 Performance Chart
	GeForce 6800U	Quadro FX 4000	Radeon X850XT	FireGL X3-256	Realizm 200
shader 2	893	596	996	731	41
shader 3	736	493	735	531	28
shader 4	737	493	732	531	28
shader 5	669	448	608	438	16
shader 6	680	467	735	530	28
shader 7	631	417	654	485	23
shader 8	383	255	406	301	11
shader 9	894	630	1263	977	55
shader 10	807	553	819	617	43
shader 11	680	467	694	509	27
shader 12	446	319	263	186	13
shader 13	383	276	361	252	13
shader 14	446	316	399	280	18
shader 15	328	244	285	206	21
shader 16	314	224	336	244	8
shader 17	425	309	429	315	8
shader 18	56	39	40	30	2
shader 19	180	134	139	99	6
shader 20	57	41	47	33	3
shader 21	90	63	-	-	-
shader 22	119	96	204	154	14
shader 23	133	106	-	-	15
shader 24	80	67	143	108	118
shader 25	97	69	118	86	6
shader 26	93	67	123	89	6

Not surprisingly, the consumer level parts are the top performers here. The Quadro FX 4000 and FireGL X3-256 don't do a bad job of keeping up with their desktop counterparts. However, the Wildcat Realizm 200 puts in a very poor showing. In addition to this, the Realizm didn't render many of the shaders correctly. Granted, the Microsoft reference rasterizer does not create correct images, but they are close in most cases. Shadermark generates MSE (mean squared error) data for screenshots taken and compared against reference images. Both ATI and NVIDIA hit between 0.5 and 1 in most tests. There is not a single shader rendered on the Wildcat Realizm 200 with an MSE of less than about 2.5. Most shaders show very clear image quality issues.

With the image quality of Wildcat Realizm in Doom 3 and Half-Life 2 being dead on with the other cards, and performance under Half-Life 2 not being as bad as we expected, we have to wonder how much of the issues with Shadermark would translate into actual applications. And by applications, we mean any application that allows the creation and/or visualization of HLSL or GLSL shaders. DCC workstation users are becoming more and more involved in the process of creating and designing complex shader effects. In order to maintain a firm position in the future of DCC workflow, 3Dlabs will need to assure smooth, accurate support of HLSL, no matter what the application running the code.

We will continue to evaluate programs for benchmarking GLSL performance. Through observation, the NVIDIA and 3Dlabs parts have an advantage over the ATI parts in GLSL performance. Unfortunately, we don't have any quantitative tests to bring to the table at this time.

Image Quality

The first issue that we will address is trilinear and anisotropic filtering quality. All three architectures support at least 8:1 anisotropic sampling, with ATI and NVIDIA including 16:1 support. We used the D3D AF Tester to examine the trilinear and anisotropic quality of each card, and found quite a few interesting facts. NVIDIA does the least amount of pure trilinear filtering, opting for a "brilinear" method, which is bilinear near mip levels and trilinear near transistions. ATI's trilinear filtering seems a bit noisy, especially when anisotropic filtering is enabled. 3Dlabs does excellent trilinear filtering, but their anisotropic filtering algorithm is only really applied to surfaces oriented near horizontally or vertically.

Of course, pictures are worth a thousand words:

This is the 3Dlabs card with 8xAF applied.

This is the ATI card with 16xAF applied.

This is the NVIDIA card with 16xAF applied.

Anisotropic filtering is employed less in professional applications than in games, but trilinear filtering is still very important. Since the key factor in trilinear filtering is to hide transitions between mip-map levels, and the NVIDIA card accomplishes this, we don't feel that this is a very large blow against the Quadro line. Of course, we would like to have the option of enabling or disabling this in the Quadro driver as we do in the consumer level driver. In fact, the option seems almost more important here, and we wonder why it is missing.

On the application side, we were able to use the SPECapc benchmarks to compare image quality between the cards, including custom drivers. We will want to take a peek at line AA quality. Looking at one of the images captured from the 3dsmax APC test, we can easily compare the quality of line AA between all three cards. Looking at the diagonal lines framing the camera's view volume, we can see that ATI does a better job of smoothing lines in general than either of the other two GPUs. These same lines look very similar on the NVIDIA and 3Dlabs implimentation. Upon closer examination, however, the Quadro FX 4000 presents an advantage. Horizontal and vertical lines have slightly less weight than on the other two architectures. This helps keep complex wireframe images from getting muddy. Take a look at what we're talking about:

The Wildcat Realizm 200 with line antialising under 3dsmax 6.

The Quadro FX 4000 with line antialising under 3dsmax 6.

The FireGL X3-256 with line antialising under 3dsmax 6.

We only noticed one difference between the capabilities of the cards when looking at either standard OpenGL or custom drivers. It seems that the 3Dlabs card is unable to support stipple patterns for lines (either that or it ignores the hint for 3dsmax). Here's a screenshot of the resulting image, again from the 3dsmax APC test (the sub-object edges test).

The Quadro FX 4000 line stipple mask under 3dsmax 6.

The FireGL X3-256 line stipple mask under 3dsmax 6.

The Wildcat Realizm 200 line stipple mask under 3dsmax 6.

The Quadro FX 4000 gets big quality points for their line stippling quality. It's not a very widely used feature of OpenGL, but the fact that the 3Dlabs card doesn't even make an attempt (the FireGL X3 support is quite pathetic) is not what we want to see at all. This is especially true in light of the fact that both of our consumer level cards were able to put off images with the same quality of the ATI workstation class card under the D3D driver.

Moving on to shader quality, we would like to mention again that GLSL shader quality on the 3Dlabs part is top notch and second to none. Since we don't have an equivalent to Shadermark in the GLSL world, we'll only take a look at HLSL shader support.

For ATI, 3Dlabs, and NVIDIA, we were running in ps2_0b, ps2_0a, and ps3_0 mode respectively. We're taking a look at shader 15 from Shadermark v2.1, and you can notice that ATI and NVIDIA render the image slightly differently, but there is a bit of quantization evident in the 3Dlabs image. This type of error was apparent in multiple shaders (though there were plenty that were clean looking).

Quadro FX 4000

FireGL X3-256

Wildcat Realizm 200

We really do hope that through driver revisions and pushing further into the Microsoft and DirectX arena, 3Dlabs can bring their HLSL support up to the level of their GLSL rendering quality.

Final Words

The sheer amount of data contained in the review is overwhelming, and if you've made it this far, congratulations.

Architecturally, ATI and NVIDIA both base their workstation level parts on consumer level boards. The 3Dlabs workstation-only approach is tried and true in the market place. The similarities between the architectures serve to validate all of the parts as high quality workstation solutions.

Among the disappointments that we suffered during testing was the lack of a GLSL benchmark test that could balance out the picture we saw with Shadermark. The consumer-based architectures of ATI and NVIDIA will have a natural bias toward HLSL support, while 3Dlabs hasn't the need to put much effort into optimizing its HLSL path. The firm grasp that OpenGL has as a standard among workstation applications goes well beyond inertia. The clean, state driven approach of OpenGL is very predictable, well defined, and powerful. It is only natural for 3Dlabs to prefer support for GLSL first and foremost, while NVIDIA and ATI cater to Microsoft before anyone else. We are working to solve this problem and hope to bring a solution to our next workstation article.

We also ran into an issue while testing our Quadro FX 4000 on the DK8N board. Running SPECviewperf without setting the affinity of the process to a single processor resulted in a BSOD (stop 0xEA) error. We are working with NVIDIA to determine the source of this issue.

In tallying up the final results of our testing today, we have to take a look at the situation from a couple of different perspectives.

The largest market in workstation graphics is the CAD/CAM market, and most large scale engineering and design firms have a very large budget for workstation components. In those cases, the top productivity is sought after at all times, and so the top performing part in the case of the application used will be purchased with little regard for cost. As most of our benchmarks show, the NVIDIA Quadro FX 4000 is able to push ahead of the competition. Notable exceptions are the ensight and solidworks SPECviewperf viewsets. Generally speaking, if an engineer needs the highest performing AGP workstation part on the market today, he or she will need the Quadro FX 4000, and cost will be no object.

The DCC workstation market is smaller than the CAD/CAM segment. It also sees more small to mid-sized design houses. Here, cost is more of a factor than at a company that would, for instance, design cars. When looking at a workstation part, productivity is going to be an important factor, but price/performance is going to be a much more important factor. With the 3Dlabs Wildcat Realizm 200 coming in just behind the Quadro FX 4000 in most cases, the significantly lower cost makes it a much better value to those on a budget. The street price of the Quadro FX 4000 is at least $700 more than either the Realizm 200 or the FireGL X3-256. That's almost enough to pick up a second 3Dlabs or ATI solution.

The ATI FireGL X3-256 is really targeted at an upper mid-range workstation position and the performance numbers hit their target very solidly. The ATI part is, after all, a 12 pixel pipe solution clocked at 490MHz. The high end consumer part from ATI is a 16 pixel pipe part clocked at 500MHz. Bringing out an AGP based solution derived from the XT line with 1.6ns GDDR3 (rather than the 2.0ns the X3 has), would very likely push ATI up in performance against its competition. It might simply be that ATI doesn't want to step on its FireGL V7100 PCI Express part, which is just about what we want to see in a high end workstation solution. When all is said and done, the FireGL X3-256 is a very nice upper mid-range workstation card that is even able to top the high end AGP workstation parts in a benchmark or two. The antialiased line support is faster and smoother looking than the competition in most cases, but when a lot of lines are piled on top of one another, the result can look a little blurrier than the other two cards.

The real downside of the FireGL X3-256 is that we were able to find Wildcat Realizm 200 cards for lower prices. The FireGL parts are currently selling for very nearly their MSRP, which may indicate that ATI is having some issue with availability even on the workstation side. With the 3Dlabs solution priced at the same level as the ATI solution, there is almost no reason why not to go with the higher performing Wildcat Realizm 200.

But if your line of work requires the use of HLSL shaders, or you are a game developer hoping to do double-duty with DCC applications and work with in-engine tools, the 3Dlabs Wildcat Realizm 200 is not for you. GLSL shaders are quite well supported on the Realizm line, but anything having to do with HLSL runs very slowly. Many of the Shadermark shaders looked fine, but the more complex ones seemed to break down. This can likely be fixed through driver updates if 3Dlabs is able to address HLSL issues in a timely and efficient manner. If price performance is an issue, a workstation part is called for, and HLSL is needed (say, you're with a game design firm and you want to test and run your HLSL shaders in your DCC application), then we can give a thumbs up to the FireGL X3-256.

We were also disappointed to see that the Wildcat Realizm didn't produce the expected line stippling under the 3DStudio Max 6 SP1. There are line stipple tests in the SPECviewperf 8.0.1 benchmark that appeared to run fine, so we are rather surprised to see this. A fix for the flickering viewports when using the custom driver is also something that we want to see.

The final surprise of the day was how poorly the consumer level cards performed in comparison to the rest of the lineup. Even though we took the time to select the highest clocked monstrosities that we could find, there was nothing that we could do to push past the workstation parts in performance most of the time. There were some cases where individual tests would be faster, but not in the types of tests that we see most used in workstation settings. Generally, pushing vertices and lines, accelerating OpenGL state and logic operations, supporting overlay planes, having multiple clip regions, supporting hardware 2-sided lighting in the fixed function pipeline, and all the other extra goodies that workstation class hardware has just makes these applications run a lot faster.

On the high end of performance in the AGP workstation market, we have the NVIDIA Quadro FX 4000. The leader in price/performance for AGP workstations at the end of 2004 is the 3Dlabs Wildcat Realizm 200. Hopefully, 2005 and our first PCI Express workstation graphics review will be as exciting as this one.