I am sure there are much more advanced ways of taking benchmarks on chess engines, but I have long since dropped out of those circles. Chess engines usually scale very well from 1P and up.
i was in a similar situation on a 48 core opteron machine.
without numa my app was twice slower than a 4 core i7 920. then did a test with same number of threads but with 2 sockets (24 cores), the app became faster than with 48 cores :~ then found the issue is all with numa which is not a big issue if you are using a 2 socket machine.
once i coded the app to be numa aware the app is 6 times faster.
i know there are few apps that are both numa aware and scale to 50 or so cores but ...
I'd llove to see single core workstations used as baseline comparisons. In using a server to render, I'd be wondering which would be more cost effective to render animations. Maybe use an animation sequence as a render performance test.
Agreed - performance of a single i7 2600 can be hard to beat, depending on the application. My Matlab code uses all physical cores through the Intel Math Kernel Library, yet is ~30% slower on 2 x X5570 (wich is about the difference in clock speed, incidently).
Yes - the backburner test is it. Although I use different rendering software, that test would be appropriate as the visualization rendering can properly represent real life usage and can stress the hardware at the same time.
The test linked uses frames 20-29. I'd like to see a longer frame sequence.
The reason I asked that a workstation be used as a base reference is because that gives us, the readers, a point of reference to compare against. I define a workstation as a single CPU box anyone can build with off the shelf components, like a i7-2600K, or a i7-970 - a performance CPU in the $300+ to $600 range. That allows one to compare performance on a per $ basis.
Not a true 'workstation' as it does not use a Xeon, but it gives the ability to compare 'performance' to 'performance per buck' basis.
By using a $1000+ class CPU for comparison the 'bang for the buck' comparison is distorted.
I love reading about the high end server hardware, its like F1 compared to road cars.
As for benchmarks, may I suggest the linux x64 Folding at Home client? We know it scales past at least 128 cores without issue and as many of us that fold are running server hardware anyway, it will attract a new audience to the reviews.
Hello, for CFD benchmarking you could also consider the code OpenFOAM. It scales very well and is gaining a lot of interest in industry and academia. Memory behaviour should be comparable to Fluent and it can be compiled with gcc and icc.
Very nice suggestion... but is there a sample solution/ benchmark we can measure? It is a bit hard for a hardware reviewer to come up with very specialized realworld tests :-).
I am currently using a bunch of 2600k's for rendering in the past I used some dual xeon setups but only found those being extremely inefficient on cost/performance ratio. Can you please let us know the cost and power consumption of this system?
currently getting around 8.72 points on cinebench 11.5 on a 2600k pc @4.5ghz which is consuming less than 200 watts at full load and costing a bit less than 800usd
also I would suggest using vray for multi thread benchmarks
Why didn't you set up a scene in Maya or Softimage and then render it with Mental Ray? THAT would be a professional test, Cinebench is not.
BTW, no matter how powerful, these Xeon E7 systems are a no-go for studios. They are plainly anti-economical. You can have a much sensibler setup by putting ordinary Xeons or overclocked Core i7s in many racks, i.e., a rendering farm.
(Note: I build rendering farms for studios. Since 3D rendering grows almost linearly with frequency, what matters in the end is Euros/GHz, that is normalized GHz)
My studio does. We can't yet step up to a higher end multi-socket rendering server (finances, start-up company) so we make do with Phenom II x4's. A desktop box is good value for money at our end of the company scale. Once we grow we'll be looking at Interlagos however
Don't condemn him blindly. By overclocking they can get substantially more performance from a similar budget. that's more efficiency - if done right.
The question is "what happens in case of failure". If it's just a crashing machine, the rendering can be repeated by another one and this machine can be tuned down a bit. If it's a visual artefact during rendering, the redering can be repeated by any machine and this machine tuned down a bit. What else could go wrong in rendering? Obviously you wouldn't want to OC your web server or database..
BTW: there was an article here some time ago, showing Cyrix doing their testing on OC'ed i7s.
Don't be so sure. Recently You can see standard desktop CPU beating expensive Xeons in professional applications. Example: http://www.solidworks.com/sw/support/shareyourscor... So You don't need to buy very expensive DELL or other workstation, instead go to PC boutique near the corner :)
This was not meant to be a professional rendering test. It was more an experiment to give the enthusiasts an idea what these servers are capable off. If you have a suggestion on which animation we should use in our benchmarking scenario's let me know. I have solid background in the "web- database - virtualization" field (I have been active in the field for more than 10 years now, teaching and consulting) , but rendering and HPC is something I only know from a benchmarking background :-).
I'll echo the other sentiments here. If a Xeon system renders something twice as fast as the Opteron system, but takes five times the power draw to do it, it's a net-win for the Opteron system. Performance / watt would be a useful metric in these comparisons, especially as systems like these will be going in data rooms where excess wattage = excess heat = excess money.
I would also be interested to know what a comparision would be between a "big iron" system like this versus a "traditional" render farm composed of some Core i7 machines.
Awesome review, though! I'm especially happy with the fact that you didn't just say "Oh, the Operteron kinda sucks in this test. Oh well." but actually took a look deeper into what's going on with the benchmark and the workload. THAT's the type of analysis that makes me keep coming back to AnandTech. :-)
Pretty sure perf/Watt isn't going to be in Opteron's favor, but there's a lot of stuff you need to account for. Johan did some measurements of power use on these servers previously (http://www.anandtech.com/show/4285/6), but as pointed out the Intel setup has a lot more RAS features and such that could be adding to the power use. Even so, the "load" power measured (using vAPUS, which may use less than something like 3D rendering) is around 875W (HT off) to 920W for the Intel E7-4870 server compared to around 590W for the Opteron 6174 server.
In terms of perf/Watt, if those figures are relatively close for the benchmarks Johan has done here, then the best scores in Cinebench give 0.0355 CB/Watt for E7-4870 vs. 0.0425 CB/Watt for the Opteron 6174 -- and again, note that the 64-thread limit (tested with 40) means CB11.5 isn't able to make maximum use of the Intel platform. For the second test, best-case we measure 0.0194 CFD/W for Intel compared to 0.0145 CFD/W for AMD. So AMD wins in 3D rendering by 20% and Intel wins in the Euler3D CFD test by 34%, at least given the current estimates.
My gut feeling is that if all other elements and features were identical, other than the necessary chipset and CPU differences (e.g. the PSU, amount of RAM, HDDs, fans, RAS features, etc.), the difference in power draw for the two platforms should be within 100W, not the up to 340W spread measured in the earlier article. (There's also a 310W difference at idle, which gives some indication of all the other things that appear to be running on the Intel setup, as normal idle power looking at just the CPUs should be very nearly equal.) So these figures I list here are specific to the Intel Quanta QSCC-4R and Dell PowerEdge R815 and may not hold for other AMD/Intel servers. In other words, take with a grain of salt.
The intel compiler is a very good compiler for Intel CPUs, but in the past it was well known for producing poor quality binaries for non-Intel CPUs. I would still be wary of benchmarking Intel vs. AMD when running code compiled with Intel's compiler.
FWIW, I heard a while ago that Intel was "officially" going to stop artificially penalizing AMD CPUs that run Intel-compiled code.
We are doing some in house testing for high end database testing using solid state storage connected via infiniband to multisocket servers.
An example:
Dell R910 4x 8C/16T X7560 2.26GHz Xeon CPU (32C/64T total) 512GB RAM 2x 146GB 15K SAS hdd's in RAID1 (OS) 2x Mellanox QDR Infiniband 40Gbps adapters
Hooked up to some seriously fast external flash storage, we got around 6GB/s+. This allowed us to do massively multi-threaded workloads, like building an index on a 2TB database.
During these tests, we can max out all 64 Threads and put the entire box under 100% load. It was during these tests we found out that Dell has a flawed implementation of the Intel SpeedStep technology which keeps the fans from ramping up under load.
Without the fast storage, we could never have fully stress tested the box.
I think part why the opteron has bad scaling without interleaving and xeon does not is not just due to the coherence engine. Don't forget that while both have 4 sockets the Opteron is a 8 node system. The article states that there are "4 memory controllers" and "3 out of 4 operations traverse the HT link" which isn't really true as there are 8 memory controllers (and 7 out of 8 operations traverse HT, though some of them are internal HT links). You can see that this makes a difference with the bad scaling from 6 to 12 threads (though not as bad as with even more threads...).
Exactly. Actually the optimized way would normally be to split the workload into 12-thread chunks on Opterons and 20-thread chunks on Xeons. That is also a reason why 4S machines rarely seen in HPC.
They just do not make sense for 99% of the workloads there.
For those of us out there that are seriously into doing distributed computing projects, it'd be cool to see a bit of information on how these systems scale in terms of programs like BOINC, Folding@home, etc.
Scaling is pretty much perfect there, not very interesting. It may have been different back in the days when these big iron systems were starved for memory bandwidth.
Have you considered that the Opteron problem could because the software is compiled with the Intel compiler which is disabling advanced features if it doesn't detect an Intel processor? This is a common problem in that the ICC compiler sets flags that if the processor doesn't find an Intel processor it turns off SSE and all the processor extensions and runs the code in x86 compatibility mode (very slow). Any time I see results that drastically off it reads to me that the software in question is using the Intel complier.
Ifort 10 is from 2007 and is not aware of the architectures of any of these machines. It doesn't support the latest sse instructions and likely doesn't know the levels of sse supported by the cpus. You have no idea which math libraries it is linked to. It won't be using the latest Intel MKL which supports the newest chips. It isn't using the AMD optimized ACML libraries either.
What you are comparing using these compiled binaries is the performance of both systems when running intel optimized code.
You also have no idea of the levels of optimization used when compiling. Some of the highest optimization speed increases with the Intel compilers drop ANSI accuracy, or at least used to. Whether this impacts results is application specific.
Generally speaking: Intel chips are fastest with Intel compilers and Intel MKL. AMD chips are fastest with the Portland Group compilers and AMD ACML. Some code runs faster with the Goto BLAS libraries.
Ideally you want to compare benchmarks with each system under ideal conditions.
Definitely true about AMD chips and the Portland Group. I get slightly better results with GCC than the Intel compiler, partly because I know how to get it to do what I want. ;-) But Portland is still better for Fortran.
Second, there is a way to solve the NUMA problem that all HPC programmers know. Any (relatively static) data should be replicated to all processors. Arrays that will be written to by multiple threads can be duplicated with a "fire and forget" strategy, assuming that only one processor is writing to a particular element (well cache line)* in the array between checkpoints. In this particular case, you would use (all that extra) memory to have eight copies of the (frequently modified) data.
Next, if your compiler doesn't use non-temporal memory references for random access floating-point data, you are going to get clobbered just like in the benchmark. (I'm fairly sure that the Portland Group compilers use PrefetchNTA instructions by default. I tend to do my innermost loops by hand on the GCC back end, which is how I get such good results. You can too--but you really need to understand the compiler internals to write--and use--your own intrinsic routines.) What PrefetchNTA does is two things, first it prefetches the data if it is not already in a local cache. This can be a big win. What kills you with Opteron NUMA fetches is not the Hypertransport bandwidth getting clogged, it is the latency. AMD CPUs hate memory latency. ;-)
The other thing that PrefetchNTA does is to tell the caches not to cache this data. This prevents cache pollution, especially in the L1 data cache. Oh, and don't forget to use PrefetchNTA before writing to part of a cache line. This is where you can really get hit. The processor has to keep the data to be stored around until the cache line is in a local cache. (Or in the magic zeroth level cache AMD keeps in the floating point register file.) Running out of space in the register file can stall the floating point unit when no more registers are available for renaming purposes.
Oh, and one of those "interesting" features of Bulldozer for compiler gurus is that it strongly prefers to have only one NT write stream at a time. (Reading from multiple data streams is apparently not an issue.) Just another reason we have to teach programmers to cache line aligned records for data, rather than many different arrays with the same dimensions. ;-)
* This is another of those multi-processor gotchas that eats up address space--but there is plenty to go around now that everyone is using 64-bit (actually 48-bit) addresses. You really don't want code on two different CPU chips writing to the same cache line at about the same time, even if the memory hardware can (and will) go to extremes to make it work.
It used to be that AMD CPUs used 64-byte cache lines and Intel always used 256-byte lines. When the hardware engineers got together for I think the DDR memory standard, they found that AMD fetched the "partner" 64 byte line if there were no other request waiting, and Intel cut fetches at 128 bytes if there was a waiting memory request. So it turned out that the width of the cache line inside the CPUs were different, but in practice most of the main memory accesses were 128-bytes wide no matter whose CPU you had. ;-) Anyway a data point for fluid flow software tends to have 48 bytes or so per data point. (Six DP values x,y, and z, and x',y' and z'. Aligning to 64-byte boundaries is good, 128-bytes is better, and you may want to try 256-bytes on some Intel hardware...)
I'd like to add one to the request for a compiler benchmark. It might go well with the HPC study. The hardest part would, of course, be finding an unbiased way to conduct it. There's just so many compiler flags that add their own variables. Then you need source code.
If you do decide to give it a try, Visual Studio, GCC, Intel, and Portland would be a must. I don't know how Anandtech would do it, but I've been impressed before.
what if instead of using a full program, create a small test program that is compiled for each platform something like declare variables int, floats, arrays to test diferent workloads put the variables on loops and do some operation sum, div, the integers then the floats and so on measure the time that take to exit from each block the hardest part will be how to make it threadable and get acces to diferent compilers, maybe a friend? anyway great article i really enjoy it even when i never get close to that class of hardware thanks very much for the reading
Because time is totally dependent on the complexity of your scene, output resolution etc. And the score can be directly translated into time if you know the time for any of the configurations tested.
Go back to Quanta and see if they have a newer BIOS with the Core Disable feature properly implemented. I Know the big boys are now implementing the feature and it allows you to disable as many cores as you want as long as its done in pairs. So your 10c proc can be turned into 2/4/6/8 core versions as well.
So for your first test where you had to turn HT off because 80 threads was too much, you could instead turn off 2 cores per proc and synthetically create a 4p32c server and then leave HT on for the full 64 threads.
"Hyper-Threading offers better resource utilization but that does not negate the negative performance effect of the overhead of running 80 threads. Once we pass 40 threads on the E7-4870, performance starts to level off and even drop."
It isn't thread-locking that limits the performance. It isn't because it has to sync/coordinate 80-threads. It's because there's only 40 FPUs available to do the actual calculations on/with.
Unlike virtualization, where thread locking is a real possiblity because there really isn't much in the way of underlying computations (I would guess that if you profiled the FPU workload, it wouldn't show up much), whereas for CFD, solving the Navier-Stokes equations requires a HUGE computational effort.
it also depends on the means that the parallelization is done, whether it's multi-threading, OpenMP, or MPI. And even then, within different flavors of MPI, they can also yield different results; and to make things even MORE complicated, how the domain is decomposed also can make a HUGE impact on performance as well. (See the studies performed by LSTC with LS-DYNA).
I'm working on trying to convert an older CFX model to Fluent for a full tractor-trailer aerodynamics run. The last time that I ran that, it had about 13.5 million elements.
Have you guys considered trying C-ray? It scales very well with no. of cores, benefits from as many threads as one can throw at it, and the more complex version of the example render scene stresses RAM a bit aswell (the small model doesn't stress RAM at all, deliberately so). I started a page for C-ray (Google for, "c-ray benchmark", 1st link) but discovered recently it's been taken up by the HPC community and is now part of the Phoronix Test Suite (Google for, "c-ray benchmark pts", 1st link again). I didn't create C-ray btw (creds to John Tsiombikas), just took over John's results page.
Hmm, don't suppose you guys have the clout to borrow or otherwise have access to an SGI Altix UV? Would be fascinating to see how your tests scale with dozens of sockets instead of just four, eg. the 960-core UV 100. Even a result from a 40-core UV 10 would be interesting. Shared-memory system so latency isn't an issue.
Hi Johan, thank you for the very interesting article.
The Hyperthreading ON vs OFF results somewhat surprise me, as Windows Server 2008 should be able to prioritize hardware core vs logical ones. Was this the case, or you saw that logical processors were used before full hardware core utilization? If so, you probably encounter one corner case were extensive hardware sharing (and contention) between two threads produce lower aggregate performance.
STREAM triad on a 4S Xeon E7 should hit about 65GB/s, unless your memory, or UEFI/bios options are misconfigured. Firmware settings can have a HUGE difference on these systems.
Did you: Enable Hemisphere mode? Disable HT? If running Windows, assume it was Server 2008 R2 SP1? If running Windows, realize that only certain applications, compiled with specific flags will work on core counts over 64 (kgroup0). Not an issue if HT was off. Enable prefetch modes in firmware? ensure system firmware was set to max perf, and not powersaving modes? if running windows, set power options to max performance profile? (default power profile on server drops perf substantially for short burst benchmarks) TPC-E is also a great benchmark to run (need some SSD storage/Fusion I/O) HPCC/Linpack are good for HPC testing.
As you can read from the icc manual when running on non INTEL processors the Non-Temporal pre-fetches are not implemented in the final machine code. This alone means it could be up to 27% faster.
Another reason why it's slower is because the "standard" HW configuration of the Opteron throttles the DRAM pre-fetchers when under load. Under Linux this behaviour can be changed from shell and should add another 5~10% increase in performance.
So this benchmark should show ~ 30% higher number for the Opteron.
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
52 Comments
Back to Article
derrickg - Friday, September 30, 2011 - link
Would love to see them benchmarked using such a powerful machine.JohanAnandtech - Friday, September 30, 2011 - link
Suggestions how to get this done?derrickg - Friday, September 30, 2011 - link
simple benchmarking: http://www.linuxhaxor.net/?p=1346I am sure there are much more advanced ways of taking benchmarks on chess engines, but I have long since dropped out of those circles. Chess engines usually scale very well from 1P and up.
JPQY - Saturday, October 1, 2011 - link
Hi Johan,Here you have my link how people can test with Chess calculatings in a very simple way!
http://www.xtremesystems.org/forums/showthread.php...
If you are interested you can always contact me.
Kind regards,
Jean-Paul.
JohanAnandtech - Monday, October 3, 2011 - link
Thanks Jean-Paul, Derrick, I will check your suggestions. Great to see the community at work :-).fredisdead - Monday, April 23, 2012 - link
http://www.theinquirer.net/inquirer/review/2141735...dear god, at last the truth. Interlagos is 30% faster
hey anand, whats up with YOUR testing.
fredisdead - Monday, April 23, 2012 - link
everybody, the opteron is 30% fasterhttp://www.theinquirer.net/inquirer/review/2141735...
follow thew intel ad bucks ... lol
anglesmith - Friday, September 30, 2011 - link
i was in a similar situation on a 48 core opteron machine.without numa my app was twice slower than a 4 core i7 920. then did a test with same number of threads but with 2 sockets (24 cores), the app became faster than with 48 cores :~
then found the issue is all with numa which is not a big issue if you are using a 2 socket machine.
once i coded the app to be numa aware the app is 6 times faster.
i know there are few apps that are both numa aware and scale to 50 or so cores but ...
tynopik - Friday, September 30, 2011 - link
benhcmarklike it Phenom
JoeKan - Friday, September 30, 2011 - link
I'd llove to see single core workstations used as baseline comparisons. In using a server to render, I'd be wondering which would be more cost effective to render animations. Maybe use an animation sequence as a render performance test.MrSpadge - Friday, September 30, 2011 - link
Agreed - performance of a single i7 2600 can be hard to beat, depending on the application. My Matlab code uses all physical cores through the Intel Math Kernel Library, yet is ~30% slower on 2 x X5570 (wich is about the difference in clock speed, incidently).MrS
JohanAnandtech - Friday, September 30, 2011 - link
http://www.anandtech.com/show/4486/server-renderin...the core i970 3.2 GHz is included. But indeed, it has been some time since we have used backburner.
Is this the kind of bench you are looking for?
http://www.anandtech.com/show/2240/7
Backburner scales extremely well, so I suspect that especially the Quad MC Dell is a very good choice compared to a workstation.
JoeKan - Friday, September 30, 2011 - link
Yes - the backburner test is it. Although I use different rendering software, that test would be appropriate as the visualization rendering can properly represent real life usage and can stress the hardware at the same time.The test linked uses frames 20-29. I'd like to see a longer frame sequence.
The reason I asked that a workstation be used as a base reference is because that gives us, the readers, a point of reference to compare against. I define a workstation as a single CPU box anyone can build with off the shelf components, like a i7-2600K, or a i7-970 - a performance CPU in the $300+ to $600 range. That allows one to compare performance on a per $ basis.
Not a true 'workstation' as it does not use a Xeon, but it gives the ability to compare 'performance' to 'performance per buck' basis.
By using a $1000+ class CPU for comparison the 'bang for the buck' comparison is distorted.
xxtypersxx - Friday, September 30, 2011 - link
I love reading about the high end server hardware, its like F1 compared to road cars.As for benchmarks, may I suggest the linux x64 Folding at Home client? We know it scales past at least 128 cores without issue and as many of us that fold are running server hardware anyway, it will attract a new audience to the reviews.
rehm - Friday, September 30, 2011 - link
Hello,for CFD benchmarking you could also consider the code OpenFOAM. It scales very well and is gaining a lot of interest in industry and academia. Memory behaviour should be comparable to Fluent and it can be compiled with gcc and icc.
Regards
JohanAnandtech - Friday, September 30, 2011 - link
Very nice suggestion... but is there a sample solution/ benchmark we can measure? It is a bit hard for a hardware reviewer to come up with very specialized realworld tests :-).ozztheforester - Friday, September 30, 2011 - link
I am currently using a bunch of 2600k's for rendering in the past I used some dual xeon setups but only found those being extremely inefficient on cost/performance ratio. Can you please let us know the cost and power consumption of this system?currently getting around 8.72 points on cinebench 11.5 on a 2600k pc @4.5ghz which is consuming less than 200 watts at full load and costing a bit less than 800usd
also I would suggest using vray for multi thread benchmarks
sicofante - Friday, September 30, 2011 - link
Why didn't you set up a scene in Maya or Softimage and then render it with Mental Ray? THAT would be a professional test, Cinebench is not.BTW, no matter how powerful, these Xeon E7 systems are a no-go for studios. They are plainly anti-economical. You can have a much sensibler setup by putting ordinary Xeons or overclocked Core i7s in many racks, i.e., a rendering farm.
(Note: I build rendering farms for studios. Since 3D rendering grows almost linearly with frequency, what matters in the end is Euros/GHz, that is normalized GHz)
Phynaz - Friday, September 30, 2011 - link
What studio renders on overclocked desktop cpu's?confusis - Friday, September 30, 2011 - link
My studio does. We can't yet step up to a higher end multi-socket rendering server (finances, start-up company) so we make do with Phenom II x4's. A desktop box is good value for money at our end of the company scale. Once we grow we'll be looking at Interlagos howeverPhynaz - Friday, September 30, 2011 - link
If you are overclocking in a business environment, what other moronic decisions has your company made?When does the going out of business sale start?
MrSpadge - Friday, September 30, 2011 - link
Don't condemn him blindly. By overclocking they can get substantially more performance from a similar budget. that's more efficiency - if done right.The question is "what happens in case of failure". If it's just a crashing machine, the rendering can be repeated by another one and this machine can be tuned down a bit. If it's a visual artefact during rendering, the redering can be repeated by any machine and this machine tuned down a bit. What else could go wrong in rendering? Obviously you wouldn't want to OC your web server or database..
BTW: there was an article here some time ago, showing Cyrix doing their testing on OC'ed i7s.
MrS
Kvarta - Tuesday, October 4, 2011 - link
Don't be so sure. Recently You can see standard desktop CPU beating expensive Xeons in professional applications. Example:http://www.solidworks.com/sw/support/shareyourscor...
So You don't need to buy very expensive DELL or other workstation, instead go to PC boutique near the corner :)
JohanAnandtech - Saturday, October 1, 2011 - link
This was not meant to be a professional rendering test. It was more an experiment to give the enthusiasts an idea what these servers are capable off. If you have a suggestion on which animation we should use in our benchmarking scenario's let me know. I have solid background in the "web- database - virtualization" field (I have been active in the field for more than 10 years now, teaching and consulting) , but rendering and HPC is something I only know from a benchmarking background :-).WeaselITB - Friday, September 30, 2011 - link
I'll echo the other sentiments here. If a Xeon system renders something twice as fast as the Opteron system, but takes five times the power draw to do it, it's a net-win for the Opteron system. Performance / watt would be a useful metric in these comparisons, especially as systems like these will be going in data rooms where excess wattage = excess heat = excess money.I would also be interested to know what a comparision would be between a "big iron" system like this versus a "traditional" render farm composed of some Core i7 machines.
Awesome review, though! I'm especially happy with the fact that you didn't just say "Oh, the Operteron kinda sucks in this test. Oh well." but actually took a look deeper into what's going on with the benchmark and the workload. THAT's the type of analysis that makes me keep coming back to AnandTech. :-)
Thanks,
-Weasel
JarredWalton - Friday, September 30, 2011 - link
Pretty sure perf/Watt isn't going to be in Opteron's favor, but there's a lot of stuff you need to account for. Johan did some measurements of power use on these servers previously (http://www.anandtech.com/show/4285/6), but as pointed out the Intel setup has a lot more RAS features and such that could be adding to the power use. Even so, the "load" power measured (using vAPUS, which may use less than something like 3D rendering) is around 875W (HT off) to 920W for the Intel E7-4870 server compared to around 590W for the Opteron 6174 server.In terms of perf/Watt, if those figures are relatively close for the benchmarks Johan has done here, then the best scores in Cinebench give 0.0355 CB/Watt for E7-4870 vs. 0.0425 CB/Watt for the Opteron 6174 -- and again, note that the 64-thread limit (tested with 40) means CB11.5 isn't able to make maximum use of the Intel platform. For the second test, best-case we measure 0.0194 CFD/W for Intel compared to 0.0145 CFD/W for AMD. So AMD wins in 3D rendering by 20% and Intel wins in the Euler3D CFD test by 34%, at least given the current estimates.
My gut feeling is that if all other elements and features were identical, other than the necessary chipset and CPU differences (e.g. the PSU, amount of RAM, HDDs, fans, RAS features, etc.), the difference in power draw for the two platforms should be within 100W, not the up to 340W spread measured in the earlier article. (There's also a 310W difference at idle, which gives some indication of all the other things that appear to be running on the Intel setup, as normal idle power looking at just the CPUs should be very nearly equal.) So these figures I list here are specific to the Intel Quanta QSCC-4R and Dell PowerEdge R815 and may not hold for other AMD/Intel servers. In other words, take with a grain of salt.
RandomUsername3245 - Friday, September 30, 2011 - link
The intel compiler is a very good compiler for Intel CPUs, but in the past it was well known for producing poor quality binaries for non-Intel CPUs. I would still be wary of benchmarking Intel vs. AMD when running code compiled with Intel's compiler.FWIW, I heard a while ago that Intel was "officially" going to stop artificially penalizing AMD CPUs that run Intel-compiled code.
James5mith - Friday, September 30, 2011 - link
Just a note:We are doing some in house testing for high end database testing using solid state storage connected via infiniband to multisocket servers.
An example:
Dell R910
4x 8C/16T X7560 2.26GHz Xeon CPU (32C/64T total)
512GB RAM
2x 146GB 15K SAS hdd's in RAID1 (OS)
2x Mellanox QDR Infiniband 40Gbps adapters
Hooked up to some seriously fast external flash storage, we got around 6GB/s+. This allowed us to do massively multi-threaded workloads, like building an index on a 2TB database.
During these tests, we can max out all 64 Threads and put the entire box under 100% load. It was during these tests we found out that Dell has a flawed implementation of the Intel SpeedStep technology which keeps the fans from ramping up under load.
Without the fast storage, we could never have fully stress tested the box.
mczak - Friday, September 30, 2011 - link
I think part why the opteron has bad scaling without interleaving and xeon does not is not just due to the coherence engine.Don't forget that while both have 4 sockets the Opteron is a 8 node system. The article states that there are "4 memory controllers" and "3 out of 4 operations traverse the HT link" which isn't really true as there are 8 memory controllers (and 7 out of 8 operations traverse HT, though some of them are internal HT links).
You can see that this makes a difference with the bad scaling from 6 to 12 threads (though not as bad as with even more threads...).
extide - Friday, September 30, 2011 - link
Dont forget the Xeon E7 is 4 sockets with 4 memory channels each.mino - Saturday, October 1, 2011 - link
Memory channel count has nothing to do with coherency traffic.mino - Saturday, October 1, 2011 - link
Exactly. Actually the optimized way would normally be to split the workload into 12-thread chunks on Opterons and 20-thread chunks on Xeons. That is also a reason why 4S machines rarely seen in HPC.They just do not make sense for 99% of the workloads there.
lelliott73181 - Friday, September 30, 2011 - link
For those of us out there that are seriously into doing distributed computing projects, it'd be cool to see a bit of information on how these systems scale in terms of programs like BOINC, Folding@home, etc.MrSpadge - Friday, September 30, 2011 - link
Scaling is pretty much perfect there, not very interesting. It may have been different back in the days when these big iron systems were starved for memory bandwidth.MrS
fic2 - Friday, September 30, 2011 - link
Was hoping for some Bulldozer server benchmarks since the server chips are "released". ;o)Didn't really think that I would see them though.
rahvin - Friday, September 30, 2011 - link
Have you considered that the Opteron problem could because the software is compiled with the Intel compiler which is disabling advanced features if it doesn't detect an Intel processor? This is a common problem in that the ICC compiler sets flags that if the processor doesn't find an Intel processor it turns off SSE and all the processor extensions and runs the code in x86 compatibility mode (very slow). Any time I see results that drastically off it reads to me that the software in question is using the Intel complier.Chibimyk - Friday, September 30, 2011 - link
Ifort 10 is from 2007 and is not aware of the architectures of any of these machines. It doesn't support the latest sse instructions and likely doesn't know the levels of sse supported by the cpus. You have no idea which math libraries it is linked to. It won't be using the latest Intel MKL which supports the newest chips. It isn't using the AMD optimized ACML libraries either.What you are comparing using these compiled binaries is the performance of both systems when running intel optimized code.
You also have no idea of the levels of optimization used when compiling. Some of the highest optimization speed increases with the Intel compilers drop ANSI accuracy, or at least used to. Whether this impacts results is application specific.
Generally speaking:
Intel chips are fastest with Intel compilers and Intel MKL.
AMD chips are fastest with the Portland Group compilers and AMD ACML.
Some code runs faster with the Goto BLAS libraries.
Ideally you want to compare benchmarks with each system under ideal conditions.
eachus - Saturday, October 1, 2011 - link
Definitely true about AMD chips and the Portland Group. I get slightly better results with GCC than the Intel compiler, partly because I know how to get it to do what I want. ;-) But Portland is still better for Fortran.Second, there is a way to solve the NUMA problem that all HPC programmers know. Any (relatively static) data should be replicated to all processors. Arrays that will be written to by multiple threads can be duplicated with a "fire and forget" strategy, assuming that only one processor is writing to a particular element (well cache line)* in the array between checkpoints. In this particular case, you would use (all that extra) memory to have eight copies of the (frequently modified) data.
Next, if your compiler doesn't use non-temporal memory references for random access floating-point data, you are going to get clobbered just like in the benchmark. (I'm fairly sure that the Portland Group compilers use PrefetchNTA instructions by default. I tend to do my innermost loops by hand on the GCC back end, which is how I get such good results. You can too--but you really need to understand the compiler internals to write--and use--your own intrinsic routines.) What PrefetchNTA does is two things, first
it prefetches the data if it is not already in a local cache. This can be a big win. What kills you with Opteron NUMA fetches is not the Hypertransport bandwidth getting clogged, it is the latency. AMD CPUs hate memory latency. ;-)
The other thing that PrefetchNTA does is to tell the caches not to cache this data. This prevents cache pollution, especially in the L1 data cache. Oh, and don't forget to use PrefetchNTA before writing to part of a cache line. This is where you can really get hit. The processor has to keep the data to be stored around until the cache line is in a local cache. (Or in the magic zeroth level cache AMD keeps in the floating point register file.) Running out of space in the register file can stall the floating point unit when no more registers are available for renaming purposes.
Oh, and one of those "interesting" features of Bulldozer for compiler gurus is that it strongly prefers to have only one NT write stream at a time. (Reading from multiple data streams is apparently not an issue.) Just another reason we have to teach programmers to cache line aligned records for data, rather than many different arrays with the same dimensions. ;-)
* This is another of those multi-processor gotchas that eats up address space--but there is plenty to go around now that everyone is using 64-bit (actually 48-bit) addresses. You really don't want code on two different CPU chips writing to the same cache line at about the same time, even if the memory hardware can (and will) go to extremes to make it work.
It used to be that AMD CPUs used 64-byte cache lines and Intel always used 256-byte lines. When the hardware engineers got together for I think the DDR memory standard, they found that AMD fetched the "partner" 64 byte line if there were no other request waiting, and Intel cut fetches at 128 bytes if there was a waiting memory request. So it turned out that the width of the cache line inside the CPUs were different, but in practice most of the main memory accesses were 128-bytes wide no matter whose CPU you had. ;-) Anyway a data point for fluid flow software tends to have 48 bytes or so per data point. (Six DP values x,y, and z, and x',y' and z'. Aligning to 64-byte boundaries is good, 128-bytes is better, and you may want to try 256-bytes on some Intel hardware...)
mino - Saturday, October 1, 2011 - link
You deserve the paycheck for this article!Howgh.
UrQuan3 - Monday, October 3, 2011 - link
I'd like to add one to the request for a compiler benchmark. It might go well with the HPC study. The hardest part would, of course, be finding an unbiased way to conduct it. There's just so many compiler flags that add their own variables. Then you need source code.If you do decide to give it a try, Visual Studio, GCC, Intel, and Portland would be a must. I don't know how Anandtech would do it, but I've been impressed before.
jaguarpp - Friday, September 30, 2011 - link
what if instead of using a full program, create a small test program that is compiled for each platform something likedeclare variables int, floats, arrays to test diferent workloads
put the variables on loops and do some operation sum, div, the integers then the floats and so on measure the time that take to exit from each block
the hardest part will be how to make it threadable
and get acces to diferent compilers, maybe a friend?
anyway great article i really enjoy it even when i never get close to that class of hardware
thanks very much for the reading
Michael REMY - Friday, September 30, 2011 - link
very interesting analyze but...why use a score in cinebench instead a time render score ?Time result are more meaning for common and pro user than integer score !
MrSpadge - Friday, September 30, 2011 - link
Because time is totally dependent on the complexity of your scene, output resolution etc. And the score can be directly translated into time if you know the time for any of the configurations tested.MrS
Casper42 - Friday, September 30, 2011 - link
Go back to Quanta and see if they have a newer BIOS with the Core Disable feature properly implemented. I Know the big boys are now implementing the feature and it allows you to disable as many cores as you want as long as its done in pairs. So your 10c proc can be turned into 2/4/6/8 core versions as well.So for your first test where you had to turn HT off because 80 threads was too much, you could instead turn off 2 cores per proc and synthetically create a 4p32c server and then leave HT on for the full 64 threads.
alpha754293 - Sunday, October 2, 2011 - link
"Hyper-Threading offers better resource utilization but that does not negate the negative performance effect of the overhead of running 80 threads. Once we pass 40 threads on the E7-4870, performance starts to level off and even drop."It isn't thread-locking that limits the performance. It isn't because it has to sync/coordinate 80-threads. It's because there's only 40 FPUs available to do the actual calculations on/with.
Unlike virtualization, where thread locking is a real possiblity because there really isn't much in the way of underlying computations (I would guess that if you profiled the FPU workload, it wouldn't show up much), whereas for CFD, solving the Navier-Stokes equations requires a HUGE computational effort.
it also depends on the means that the parallelization is done, whether it's multi-threading, OpenMP, or MPI. And even then, within different flavors of MPI, they can also yield different results; and to make things even MORE complicated, how the domain is decomposed also can make a HUGE impact on performance as well. (See the studies performed by LSTC with LS-DYNA).
alpha754293 - Sunday, October 2, 2011 - link
Try running Fluent (another CFD) code and LS-DYNA.CAUTION: both are typically usually VERY time-intensive benchmarks, so you have to be very patient with them.
If you need help in setting up standardized test cases, let me know.
alpha754293 - Sunday, October 2, 2011 - link
I'm working on trying to convert an older CFX model to Fluent for a full tractor-trailer aerodynamics run. The last time that I ran that, it had about 13.5 million elements.deva - Monday, October 3, 2011 - link
If you want something that currently scales well, Terra Vista would be a good bet (although it is expensive).Have a look at the Multi Machine Build version.
http://www.presagis.com/products_services/products...
"...capability to generate databases of
100+ GeoCells distributed to 256 individual
compute processes with a single execution."
That's the bit that caught my eye and made me think it might be useful to use as a benchmarking tool.
Daniel.
mapesdhs - Tuesday, October 4, 2011 - link
Have you guys considered trying C-ray? It scales very well with no. of cores, benefits from as
many threads as one can throw at it, and the more complex version of the example render
scene stresses RAM a bit aswell (the small model doesn't stress RAM at all, deliberately so).
I started a page for C-ray (Google for, "c-ray benchmark", 1st link) but discovered recently
it's been taken up by the HPC community and is now part of the Phoronix Test Suite (Google
for, "c-ray benchmark pts", 1st link again). I didn't create C-ray btw (creds to John Tsiombikas),
just took over John's results page.
Hmm, don't suppose you guys have the clout to borrow or otherwise have access to an SGI
Altix UV? Would be fascinating to see how your tests scale with dozens of sockets instead of
just four, eg. the 960-core UV 100. Even a result from a 40-core UV 10 would be interesting.
Shared-memory system so latency isn't an issue.
Ian.
shodanshok - Wednesday, October 5, 2011 - link
Hi Johan,thank you for the very interesting article.
The Hyperthreading ON vs OFF results somewhat surprise me, as Windows Server 2008 should be able to prioritize hardware core vs logical ones. Was this the case, or you saw that logical processors were used before full hardware core utilization? If so, you probably encounter one corner case were extensive hardware sharing (and contention) between two threads produce lower aggregate performance.
Regards.
proteus7 - Tuesday, October 11, 2011 - link
STREAM triad on a 4S Xeon E7 should hit about 65GB/s, unless your memory, or UEFI/bios options are misconfigured. Firmware settings can have a HUGE difference on these systems.Did you:
Enable Hemisphere mode?
Disable HT?
If running Windows, assume it was Server 2008 R2 SP1?
If running Windows, realize that only certain applications, compiled with specific flags will work on core counts over 64 (kgroup0). Not an issue if HT was off.
Enable prefetch modes in firmware?
ensure system firmware was set to max perf, and not powersaving modes?
if running windows, set power options to max performance profile? (default power profile on server drops perf substantially for short burst benchmarks)
TPC-E is also a great benchmark to run (need some SSD storage/Fusion I/O) HPCC/Linpack are good for HPC testing.
pventi - Monday, October 31, 2011 - link
As you can read from the icc manual when running on non INTEL processors the Non-Temporal pre-fetches are not implemented in the final machine code. This alone means it could be up to 27% faster.Another reason why it's slower is because the "standard" HW configuration of the Opteron throttles the DRAM pre-fetchers when under load.
Under Linux this behaviour can be changed from shell and should add another 5~10% increase in performance.
So this benchmark should show ~ 30% higher number for the Opteron.
www.metarstation.com
Best Regards
Pierdamiano