I had no idea you were so adept with mathematics. "Consider a point in space..." Reading this brought me back to Finite Element Analysis in college! I am very impressed. Being a ME I would have preferred some flow models using the Navier-Stokes equations, but hey I like chemistry as well.
I never did any FEM so wouldn't know where to start. The next angle of testing would have been using a C++ AMP Fluid Dynamics Simulation and adjusting the code from the SDK example like with the n-Body testing. If there is enough interest, I could spend a few days organising it for the normal motherboard reviews :)
A few members of the Overclock.net HWBot team helped testing by running my benchmark while they were using DICE/LN2/Phase Change for overclocking contests (i.e. not 24/7 runs). The i7-3770K will go over 7 GHz if (a) you get a good chip, (b) cool it down enough, and (c) know what you are doing. If you're interested in competitive overclocking, head over to HWBot, Xtreme Systems or Overclock.net - there are plenty of people with info to help you get started.
The incredible performance of those overclocked Ivy bridge systems here really hammers home the importance of raw IPC. You can spend a lot of time optimizing code, but IPC is free speed when it's available.
You might try modifying your algorithm to pin the data to a specific core (therefore cache) to keep the thrashing as low as possible. Google "processor affinity c++". I will admit this adds complexity to your straightforward algorithm. In C#, I would use a parallel loop with a range partition to do it as a starting point: http://msdn.microsoft.com/en-us/library/dd560853.a...
Mr. Cutress, Do you think with all the virtualized CPU available, researchers will still build their own system as it is something concrete to put into a grant application, versus the power-by-the-hour of cloud computing?
We examined both scenarios. Our university had cluster time to buy, and there is always the Amazon cloud. In our calculation, getting a 16 thread machine from Dell paid for itself in under six months of continuous running, and would not require a large adjustment in the way people were currently coding (i.e. staying in Windows rather than moving to Linux), and could also be passed down the research group when newer hardware is released.
If you are using production level code and manipulating it each time to get results, and you can guarantee the results will be good each time, then power-by-the-hour could work. As we were constantly writing and testing new code for different scenarios, the build/buy your own workstation won out. Having your own system also helps in building GPU codes, if you want to buy a better GPU card it is easier to swap out rather than relying on a cloud computing upgrade.
One big consideration is who the researchers are. I work in x-ray spectroscopy (as a computational theorist). Experimentalists in this field use some of our codes without wanting to bother with having big computational resources. We have looked at trying to provide some of our codes through some cloud-based service so that it can be used on demand.
Otherwise I would agree with Ian's reply. When I'm improving code, debugging code, or trying to implement new theoretical approaches I absolutely want my own hardware to do it on.
How much difference do you think Xeon Phi will make in these very different type of Computations? Will buying a Xeon Phi "pay itself out" as you said in the above comments ? (or is xeon phi linux only ?)
As far as we know, Xeon Phi will be released for Linux only to begin with. I have friends who have been able to play with them so far, and getting 700 GFlops+ in DGEMM in double precision.
It always comes down to the algorithm with these codes. It seems that if you have single precision code that doesn't mind being in a 2P system, then the GPU route may be preferable. If not, then Phi is an option. I'm hoping to get my hands on one inside H1 this year. I just have to get my hands dirty with Linux as well.
In terms of the codes used here, if I were to guess, the Implicit Finite Difference would probably benefit a lot from Xeon Phi if it works the way I hope it does.
Rather stupid question, but have you tried using PGO builds ? Also, do you build the code with the default optimizations, or use the MSVC equivalent switch of -O2 ?
Using Visual Studio 2012, all the speed optimisations were enabled including /GL, /O2, /Ot and /fp:fast. For each part I analysed the sections which took the most time using the Performance Analysis tools, and tried to avoid the long memory reads. Hence the Ex-FD uses an iterative loading which actually boosts speed by a good 20-30% than without it.
In case /Ox performs an optimisation for memory over speed in an attempt to balance optimisations. As speed is priority #1, it made more sense to me to optimise for that only. If VS2012 gave more options, I'd adjust accordingly.
Never heard of VTune, but I did use the Performance Analysis tools in VS2012 to optimise certain parts of the code.
Business and mobo makers do not use 2P mobos to get high benches or performance bragging rights per se. These systems are build for bullet-proof reliability and up time. It does no good for a mobo/system to be 3% faster if it crashes while running a month long analysis. These 2P mobos are about 100% reliability, something rarely found in a enthusiasts mobo.
Enterprise mobos are rarely sold by enthusiast marketeers. Newegg has a few enterprise mobos listed primarily because they have started a Newegg Biz website to expand their revenue streams. They don't have much in the line of true enterprise hardware however. It's a token offering because manufacturers are not likely to support whoring of the enterprise market lest they lose all of their quality vendors who provide customer technical product support.
Actually, ASUS Z9PE-D8 WS allows for some overclocking capabilities.
CPU overclocking with 2P/4P Xeon E5 (2600/4600 sequence) is a no-go because Intel explicitly did not store proper ICC data so it is impossible to manipulate BCLK meaningfully (set the different ratios). Oh, and the multipliers are locked :)
However, Z9PE D8 WS allows memory overclocking - I managed to run 100% 24/7 stable with the Samsung ECC 1600 DDR3 "low voltage" RAM (16 GB sticks) - just switching memory voltage from 1.35v to 1.55v allows overclocking memory from 1600 MHz to 2133 MHz.
Why would anyone want to do that in a scientific or b2b environment? The only usage I can see are applications where memory I/O is the biggest bottleneck. Large-scale neural simulations are one of such applications, and getting 10 GB/s more of memory I/O can help a lot - especially if stable.
Also, low-latency trading applications are known to benefit from overclocked hardware and it is, in fact, used in production environment.
Modern hardware does tend to have larger headrooms between the manufacturer's operating point and the limits - if the benefit from an overclock is more benefitial than work invested to find the point where the results become unstable - and, of course, shorter life span of the hardware - then, it can be used. And it is used, for example in some trading scenarios.
As a fellow chemist, I must say that you have to be some kind of nut to want to do computational/physical chemistry. If you need me, I'll be at the bench!
I didn't read the article in full but what I did read was top notch. I found your simulations and mathematics very interesting. I took a Physics class which was focused in writing code to run mathematical simulations . Using the given java lib. I wrote my own code to calculate PI. When I returned from the gym the program had calculated 3.1 . I then re-wrote the program from scratch and ditched the built in libs. and reran. I had 20 decimals in 30 sec it was an epic improvement.
All in all I think your article could be very useful to me.
THIS is why I read Anandtech. I'll admit that I wasn't quite in the mood to read all the equations (I'll have to do that later), but really, these kind of reviews make my day.
Hi Ian. Thanks for the nice article. I have one suggestion regarding the explicit finite difference code:
You could try to reorder the loops such that the memory access is more cache friendly. Right now 'pos' is incremented by NX (or even NX*NY in 3D) which will generate a lot cache misses for large grids. If you switch the x and the y loop (in the 2D case) this can be avoided.
Either way I order the loops, each point has to read one up, one down, one left and one right. My current code tries to keep three as consecutive reads and jump once, keeping the old jump in local memory. If I adjusted the loops, I could keep the one dimension in local memory, but I'd have to jump outside twice (both likely cache misses) to get the other data. I couldn't cache those two values as I never use them again in the loop iteration.
When I did this code on the GPU, one method was to load an XY block into memory and iterate in the Z-dimension, meaning that each thread per loop iteration only loaded one element, with a few of them loading another for the halo, but all cache aligned.
Yes, but when you access the array 'cA' at 'pos' the CPU will fetch the entire cache line (64 byte in case of your machine, i.e. 16 floats) of the corresponding memory address into the CPU cache. That means that subsequent accesses to say 'pos + 1', 'pos + 2' and so on will be served by the cache. Accessing an array in such a sequential manner is therefore fast.
However, when you access an array in a nasty way, e.g. 'NX + x' -> '2*NX + x', -> '3*NX + x', then each such access implies a trip to main memory if NX happens to be large.
That you need to move up / down and sideways in memory does not matter. When you write down the accesses of the code with the reordered loops you will notice that they just access three "lines" in memory in a cache friendly way.
Not reusing the old values of the last iteration should not affect performance in a measurable way. Even if the compiler fails to see this optimization, the accesses will be served by the L1 cache.
Btw, did you allocate the array having NUMA in mind, i.e. did you initialize your memory in an OpenMP loop with the same access pattern as used in the algorithms? I am a bit surprised by the bad performance of your dual Xeon system.
Memory was allocated via the new command as it is 1D. When using a 2D array the program was much slower. I was unaware you could allocate memory in an OpenMP way, which thinking about it could make the 2D array quicker. I also tried writing the code using the PPL and lambdas, but that was also slower than a simple OpenMP loop.
I'm coming at these algorithms from the point of view of a non-CompSci interested in hardware, and the others in the research group were chemists content to write single threaded code on multi-core machines. Transferring the OpenMP variations of that code from a 1P to a 2P, as the results show, give variable results depending on the algorithm.
There are always ways to improve the efficiency of the code (and many ways to make it unreadable), but for a large part moving to the 2P system all depends on how your code performs. Please understand that my examples being within my limits of knowledge and representative of the research I did :) I know that SSE2/SSE4/AVX would probably help, but I have never looked into those. More often than not, these environments are all about research throughput, so rather than spend a few week to improve efficiency by 10% (or less), they'd rather spend that money getting a faster system which theoretically increases the same code throughput 100%.
I'll have a look at switching the loops if I write an article similar to this in the future :)
Thank you for the detailed answer. I very much appreciate your article and hope to see more stuff like this on Anandtech.
What I meant regarding to NUMA is the following. When you have a dual socket Xeon you have two memory controllers. The first time you 'touch' a memory location it is assigned to the memory controller of the CPU that runs the current thread. This assignment is in general permanent and all further memory read/writes to that location will be served by that memory controller.
If you first-touch (e.g. initialize the array to zero) using one thread, then the whole array is assigned to one of the two memory controllers. When you then run the multi-threaded code on that array one memory controller is idle while the other is oversubscribed since it has to serve both CPUs.
In contrast, if you first-touch your array in an OpenMP loop and use the same access pattern as in the algorithm, you will benefit from both memory controllers later on. In this case your large array is correctly 'distributed' over both memory controllers.
This kind of memory layout optimization becomes extremely important when you deal with quad socket Opterons. You then have eight memory controllers. A NUMA aware code is therefore up to eight times as fast since it utilizes all memory controllers.
Hmm, this is definitely not true at least for Windows Server 2008 R2 / Windows 7, and I am sure it holds true for some versions of Linux (I am not a Linux expert).
Windows Server 2008 R2 / Windows 7 scheduler will try to match the memory allocations (even if they are not tagged for a specific NUMA node) with the NUMA node the process/thread resides on, and they will not move a thread to a foreign NUMA node unless if that has been explicitly requested by the application (by setting the thread affinity)
Of course, without explicit NUMA node tagging when doing allocations, application code is the main culprit for not respecting the NUMA layout (e.g. creating bunch of threads, allocating memory from one of them - and then pinning the threads to different CPUs - you will have lots of LLC requests from remote DRAM because memory was a-priori allocated on one node).
I describe how I extracted more than double performance by careful memory allocation (NUMA-aware) - please note that neither Windows nor Linux scheduler is able to cope with code which is not written to be NUMA aware and it is using large number of threads that are supposed to run on all CPUs.. Simply put, application writer will have to manage memory allocation and usage in the way so that there are as little remote DRAM requests as possible.
About Windows scheduler - I only worked with Windows XP, now I don't have any reason to work with Win anymore, so what you say probably is really true. As for the linux versions - well, long story short, CFS sucks and everyone knows it - this is particularly noticeable if you have fully virtualized VMs which appear as one single process at the host system - the process is randomly swapped between CPU cores and even CPU dies.... sad story. That's why people have to pin their CPUs to their tasks manually.
Ah, XP - that explains it. True, XP did not care about NUMA at all.
Windows Server 2008 / Vista introduced NUMA-aware memory allocations, and changed their CPU scheduler so it does not move the thread across NUMA nodes. They will also try to allocate the memory from the thread's own NUMA node when legacy VirtualAlloc etc. APIs are used.
Windows Server 2008 R2 / Windows 7 introduced the concept of CPU groups - allowing more than 64 CPUs. This does require some adaptation of the application, as old threading APIs only work with 64-bit affinity bitmask which only allowed recognizing 64 CPUs. Now, there is a new set of APIs that work with GROUP_AFFINITY structure, allowing control of CPU groups, too. However, this needs explicit change of the legacy process/threading APIs to the new ones.
Furthermore, none of the above can replace some manual intervention*- while Windows scheduler will, indeed, respect NUMA node boundaries and not try to mess around with moving threads across them - it still does not know what the underlying algorithm wants to do.
* There is no need to set the thread affinity to one specific CPU anymore - this prevents running the thread on any other CPU completely. Instead, there is an API called SetThreadIdealProcessor(Ex) which signals Windows scheduler that thread >should< run on that particular CPU - but, under certain circumstances the scheduler can move the thread somewhere else - if the CPU is completely taken away by some other thread/process. Scheduler will try to move the thread as close as possible - to the next core in the socket, for example - or to the next core in the group (group is always contained within a NUMA node).
You can, however, absolutely forbid Windows scheduler from passing the thread to another NUMA node under any circumstances by simply getting the said NUMA node affinity mask (GetNumaNodeProcessorMask(Ex)) and setting this affinity as a thread affinity. This + setting the "ideal" processor still gives Windows scheduler some headroom to move the thread to another core if it is found to be better in a given moment, but it will not even attempt to cross the NUMA boundary in any case whatsoever.
While I haven't personally researched them, there are tons of other schedulers that have been written for Linux and I'm certain *at least* one of them is more fitting to this line of work. I've heard of alternatives like BFS and the Linux kernel is so widely used I'm sure there's a gem out there for this application.
Have you ever tried the Intel Math Kernel Library? It might speed up some of the equations. It also hands off work to the Intel MIC card if it thinks it will speed it up.
The GA-7PESH1 motherboard is $855, and the CPU's are $2020 each, which adds up to $4895. On tasks which don't parallelize well, you can get similar performance from the i7-3770K, which costs an order of magnitude less. (Prices: i7-3770K $320, ASRock Z77 Extreme6 motherboard $152, total from motherboard and CPU $472.) On tasks which parallelize well enough that they can be run on a GPU, the system with the GA-7PESH1 will beat the i7-3770K, but will be crushed by a midrange GPU. So the price/performance of this system is pretty bad unless you throw just the right workload at it.
The motherboard price from super-laptop-parts dot com, and the other prices are from a major online retailer that I won't name in order to get around the spam filter.
The K version may not, but the standard i7-3770 does in fact support VT-D, TXT and ECC memory from the get go. Vt-D has to be also supported by the motherboard, which may be problematic on consumer motherboards. I have a i5-2400 myself, and Vt-d is a pain to setup and to this day I still haven't found out whether is it that I am unable to set up Xen properly or just that my cheap motherboard worn't support VT-d, to properly assign a video card to a virtual machine.
The 3770K lacks those features, but that doesn't invalidate my point.
Using ECC memory improves system availability, and likely decreases the probability of undetected errors resulting in incorrect computations. If these are important to you, then you should be thinking about full double or triple redundancy. Why not buy three 3770K based systems and run the same simulation on all of them? Most of the time you will get identical results on all three systems, but on rare occasions one of the systems will die during the run. No problem; you have the simulation results from the other two systems. On even rarer occasions, one of the systems will produced an incorrect result due to an undetected bit error. Again no problem; you take the results from the two simulations that agree.
With full redundancy it doesn't matter where in the system the error occurs because full redundancy addresses faults anywhere in the system. This makes it superior to ECC memory, which only addresses faults in the memory subsystem. So the only reason to go with ECC memory instead of full triple redundancy is if the ECC memory approach costs less. Based on the numbers I posted, you aren't going to get a lower cost based on hardware costs alone. Possibly you could get there by including administrative costs and the like.
I'm not saying that the system Ian tested wouldn't make sense under *any* circumstances. My point is that the system has a poor price performance ratio, so it only makes sense when a lot of things are working in its favor.
The second feature you mention is VT-D, which makes it more efficient to emulate device hardware in virtual machines. I don't have any benchmarks, but my guess is that the performance improvement from VT-D is fairly small. In any case, if you want VT-D you can buy the 3770 rather than the 3770K. You can't overclock the 3770, but my comments about the 3770K offering "similar performance" were based primarily on the performance of the 3770K at stock frequency. If you assume that everyone is going to take the time to find an optimal overclock for their CPU, then the E5-2690 (which cannot be overclocked) looks even worse.
I suppose it's off topic to debate the merits of "trusted execution technology" here, so I will simply note that if for whatever reason you want a processor that supports it, the solution is the same as for VT-D: get the 3770 instead of the 3770K.
A very well written article that sticks toward its purpose: scientific computing. Really pleased to see articles like this on the site even if I have a few minor quibbles.
On page 2 "To those unfamiliar with server boards, of note is the connector just to the right of center of the picture above." is either oddly worded to describe the front panel connector at the bottom the board (which is indeed right of center but not in the center of the picture) or describing a connector that isn't even documented in the manual. For clarification I'm looking at the connector just right of the top PCI-E 16x slot (above and to the left of the battery). Actually, what is that connector labeled as? I've seen it on other Xeon boards but have never seen it used.
The last paragraph on page 2 should read omits the possibility of nonbuffered ECC memory and implies the usage of unbuffered non-ECC memory. I haven't found confirmation that this board can accept unbuffered, non-ECC memory (opposed to the possibility of an ECC requirement as some server vendors enforce).
A couple of notes on the little processor talk on page 6. Dealing with cache thrashing between L3 and L2 is possible but when dealing with a high number of threads general coherency becomes a bigger factor. The overhead is beginning to exceed the benefit of having the additional hardware to run them. If you're lucky to be dealing with an algorithm that doesn't need such coherency overhead, then chances are it is very ideal for GPU compute (and memory capacity isn't a factor). A minor nicety would have been to see some more testing without Hyperthreading on the i7-3770k, i7-3930k, and i7-3960X to better indicate scaling with/without Hyperthreading. I suspect that those single socket processors would have been able to show some small gains with Hyperthreading where the dual socket system did not.
An extension to the L2/L3 cache talk on page 6 is the move to dual sockets and NUMA. There is a performance penalty due to latency for having one thread access memory that is found on a remote socket. Memory mirroring between sockets can eliminate that remote penalty while increasing RAS but at the cost of halving effective memory capacity. The manual isn't clear if mirroring mode or the lockstep mode is across different sockets (it can be done across memory channels as well).
I'd also would have loved to have heard some comparisons with the Gigabyte GA-X79S-UP5. While the name implies an X79 chipset, it uses the C606 chipset. It'll support ECC memory with socket 2011 Xeons and plenty of over clocking features (for the daring). Comparing the GA-7PESH1 to the GA-X79S-UP5 would have been able to answer if the move to dual sockets would have been worth the extra cost.
Part of my criticism isn't about the article itself but rather the general state of massively multithreaded hardware and software. The hardware portion is quickly running into software limitation that were never expected to be reached in the professional space. A decade ago who thought that a scientist could purchase a 240 simultaneous thread processor that would fit on a mere expansion card? In some cases we don't reach Amdal's Law before hitting an artificial barrier due to scheduling or coherency overhead.
I just noticed that the system was using Win 7 Professional which has a limit of 64 concurrent threads per process. A quad socket LGA 2011 config would actually be at the very limit of what Window 7 (or rather 2008R2 since professional only scales to two sockets) can handle. OpenMP can handle more than 64 concurrent threads but on Windows it has to submit this limitation.
As for the GA-X79S-UP5 Clocking features are only working for 1P Xeons, which are basically similar to HEDT i7 (36xx) line. With those, customer has an advantage of ECC RAM support and still some overclocking headroom.
Clocking 2P/4P Xeons E5 (sadly, these are the only 8-core parts so far) is next to impossible due to the lack of ICC configuration data allowing changing BCLK ratios. These Xeons can only be bumped by direct BCLK increase, which is dangerous above few MHz. At most, 5-6 MHz is feasible as tested on ASUS Z9PE-D8-WS and EVGA SR-X boards.
Memory overclocking is another matter, completely. I have excellent results with Samsung's 1.35v ("low voltage") ECC RAM. It is not just the cheapest 16 GB ECC option (~$160 for the 16 GB ECC stick last time I checked, I got mine for 140 EUR in Germany 7 months ago), but it is the fastest while still keeping the low voltage. This RAM can be overclocked to 2133 MHz by a simple voltage bump to 1.55v, which is still within Xeon's VSa limits.
Weird that Intel doesn't provide the ICC configuration data. The 'gear ratio' change is something I'd still expect to change on true X79 boards regardless of processors (I can see Intel crippling this on C600 series). Then again, I've heard some weird situations with LGA 2011 Xeons in desktop boards. There are some scattered reports of unlocked chips but as the internet goes there are lots of speculation and rumors but little real confirmation.
Those Samsung 16 GB ECC sticks are registered? I thought that the GA-X79S-UP5 didn't registered DIMMs.
As for the ability to overclock those low voltage DIMMs, not really surprised as they've historically been impressive in that regards. I have some older 4 GB 1.35v DDR3-1333 rated sticks that can go to 1866 Mhz at 1.5v. :) The timings had to be changed but still impressive.
From the looks of results some of the benchmarks are HIGHLY sensitive to effective bandwidth per thread (ie, GDDR5 feeding a GPU stream processor >> DDR3 feeding a Xeon HT core).
However - it must be noted that 8x DIMMS is insufficient to achieve full memory bandwidth on Xeon E5 2S!
I'd suggest throwing a pure memory bandwidth test into the mix to make sure you're actually getting the rated number (51.2GB/s)...
Great article, thanks! This is the sort of benchmark I've been wanting to see for quite some time now - simple, brute-force numerics where the code is visible and straightforward. Too many benchmarks are black boxes with processor- and compiler-specific tunes to make manufacturer "X" appear superior to "Y". That said, it would be most illustrative to perform a similar 'mark using vanilla gcc on both MS and *nix OS.
It is long known issue, when windows does not start after changing hardware, especially GPU (not always so). There is as long known trick so. Just before last "power off" one should replace GPU's own driver with basic microsoft's one. In case of GPU it is "standart Vga adapter" (device manager - update driver - browse my computer - let me pick up). In fact one can replace all specific drivers on OS with similiar basic from MS and then to put this hard drive virtually to any system without any need for fresh install. Mind you, that after first boot it takes some time for OS to find and install specific drivers.
In the early '70s I was doing very similar simulations using a PDP 11/40 minicomputer. (I can send citations to my publications if anyone is interested.) At Texas Tech and later at Caltech, I simulated systems involving heterogeneous electron transfer kinetics, various chemical reactions in solution, coulostatics, galvanostatics, voltammetry, chronocoulometry, AC voltammetry, migration, double layer effects, solution hydrodynamics (laminar only), etc. Much of this was done on a PDP 11/40, originally with 8K words (= 16K bytes) of core memory. Later the machine was upgraded to 24 K words (!), we got a floating point board, and a hard disk drive (5 M words, IIRC). My research director probably paid in excess of $50K for the hardware. One cute project was to put a simulation "inside" a nonlinear regression routine to solve for electrode kinetic parameters such as k and alpha. Each iteration of the nonlinear solver required a new simulation -- hand-coding the innermost loops using floating point assembly instructions was a big speedup!
I wonder how the old PDP would stack up against the 3770?
Do you guys think that once Haswell moves the VRM on package that someone might do a 2 socket mATX board?
Even if it means giving up 2 of the 4 PCIe slots and/or 2 DIMMs per socket it would be nice to have a high core count standard SFF board for those that need just that.
I found this review interesting, but I don't think this board is really targeted at the HPC market. It seems like it would be good as part of a 2U / 12 + 2 drive system, similar to the Dell C2100. It would make a good virtual host, SQL, active web server etc. Having the 3 mSAS connectors would enable 4 drive each without the need for a SAS expander.
Servers are designed for 99.999% uptime, remote management, and hands-off operation. To achieve that you need redundent power, UPS, Networking, storage etc. They also require high airflow, which is noisy and not something you want sitting under your desk. Based on that, it makes sense that the MB is intended for sale to system builders not your general build your own enthusiast.
HW manufactuerers are faced with a similar problem to airlines - consumers gravitate to the cheapest price, and so the only real money to be made is selling higher profit margin products to businesses. Servers are where intel etc makes their profits.
For the computational problems the author is trying to solve, to me it would seem to be better to consider: a) At one point, I think google was using commodity hardware, with custom shelving etc. Assuming the algorithms can be paralleled on different hosts, you shouldn't need the reliability of traditional servers, so why not use a number of commodity systems together, choosing the components that have the best perf/$.
b) There are machines designed for HPC scenarios, such as HPC Systems E5816 that supports 8x Xeon E7-8000 (10 core) processors, or the E4002G8 - that will take 8 nVidia Tesla cards.
c) What about developing and testing the software on cheap worstations, and then when you are sure its ready, buying compute time from Amazon cloud services etc.
It is quite delighting to look at your review on Anandtech (especially when I am using software and computer configurations of similar nature for my studies), as it is quite difficult for me to evaluate the performance gain of "real-life" software (i.e. science oriented in my case) on new hardware before buying.
From what I have seen in your code segments provided (especially for the n-body simulation part) , there are large amount of floating-point divisions. Is there any possibility that the code is not only limited by the cache size(and thrashing), but by the limited throughput of the floating-point divider? (i.e. The performance degradations when HT is enabled may also be caused by the competition of the two running threads on the only floating-point divider in the core)
if you post zipped sources and exes for anyone to follow, learn, play, argue and eventually improve.
I'd also preferred to see Fortran sources and benchmarks when possible.
Intel/AMD should start promote 2/4/8 socket monster mobos for enthusiasts and then general public since this is the beginning of the infinite in time era for multiprocessing.
Also where are games benchmarks like for example GTA4 which benefits a lot from multicores as well as from GPUs?
The n-body simulations are part of the C++ AMP example page, free for everyone to use. The rest of the code is part of a benchmark package I'm creating, hence I only give the loops in the code. Unfortunately I know no Fortran for benchmarks.
Most mainstream users (i.e. gamers) still debate whether 4 or 6 cores are even necessary, so moving to 2P/4P/8P is a big leap in that regard. Enthusiasts can still get the large machines (a few folders use quad AMD setups) if they're willing to buy from ebay which may not always be wholly legal. You may see 2P/4P/8P becoming more mainstream when we start to hit process node limits.
I don't know if I speak for everyone, but I would really love to see some gaming benchmarks.
I realize that this system is not designed or optimized for gaming, but it would be interesting nonetheless to see what two processors does, or does not do for gaming. :)
I'm very curious as the result. Have you tried to bench 1 2690 vs 2 2690s? It almost like the benchmark are limited by CPU frequency instead of threads/cores it has.
Ian - I really enjoyed reading your effort here. There is a large, and I think underserved, community of scientific users who need this kind of information. Digging through IEEE/ASM communications is often just too much. Doing the work here - or anywhere - is a real help.
I'm a retired economist (PhD Chicago, '81) and (in my case) thankfully haven't done physical, much less computational, chemistry since undergrad. Never the less, we have similar technical needs.
I've become a huge fan of open source software. In my "home lab," which my wife calls The Frat House, some grad students and I have been diligently working with the R Language (statistics), nascent risk and optimization tools, and a mash-up of database, data warehouse, and "business intelligence" tools, all open source. The goal someday -- beat SAS silly and obviate that $100-300K price tag!
The more demure and do-able daily work is just cleaning up and optimizing open source code, contributing that back as individual packages. The "hits" and email indicates a good adoption rate.
Ian, CUDA is of big interest to the people we're in communication with, and I have to admit some real fascination personally. As you have real-world experience, how about a series of articles. I hope ANDATECH would support that work!!
Very best to you - hope to be "reading" you soon!!
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
64 Comments
Back to Article
Hulk - Saturday, January 5, 2013 - link
I had no idea you were so adept with mathematics. "Consider a point in space..." Reading this brought me back to Finite Element Analysis in college! I am very impressed. Being a ME I would have preferred some flow models using the Navier-Stokes equations, but hey I like chemistry as well.IanCutress - Saturday, January 5, 2013 - link
I never did any FEM so wouldn't know where to start. The next angle of testing would have been using a C++ AMP Fluid Dynamics Simulation and adjusting the code from the SDK example like with the n-Body testing. If there is enough interest, I could spend a few days organising it for the normal motherboard reviews :)Ian
mayankleoboy1 - Saturday, January 5, 2013 - link
How the frick did you get the i7-3770K to *5.4GHZ* ? :shock:How the frick did you get the i7-3770K to *5.0GHZ* ? :shock:
IanCutress - Saturday, January 5, 2013 - link
A few members of the Overclock.net HWBot team helped testing by running my benchmark while they were using DICE/LN2/Phase Change for overclocking contests (i.e. not 24/7 runs). The i7-3770K will go over 7 GHz if (a) you get a good chip, (b) cool it down enough, and (c) know what you are doing. If you're interested in competitive overclocking, head over to HWBot, Xtreme Systems or Overclock.net - there are plenty of people with info to help you get started.Ian
JlHADJOE - Tuesday, January 8, 2013 - link
The incredible performance of those overclocked Ivy bridge systems here really hammers home the importance of raw IPC. You can spend a lot of time optimizing code, but IPC is free speed when it's available.jd_tiger - Saturday, January 5, 2013 - link
http://www.youtube.com/watch?v=Ccoj5lhLmSQsmonsees - Saturday, January 5, 2013 - link
You might try modifying your algorithm to pin the data to a specific core (therefore cache) to keep the thrashing as low as possible. Google "processor affinity c++". I will admit this adds complexity to your straightforward algorithm. In C#, I would use a parallel loop with a range partition to do it as a starting point: http://msdn.microsoft.com/en-us/library/dd560853.a...nickgully - Saturday, January 5, 2013 - link
Mr. Cutress,Do you think with all the virtualized CPU available, researchers will still build their own system as it is something concrete to put into a grant application, versus the power-by-the-hour of cloud computing?
Thanks.
IanCutress - Saturday, January 5, 2013 - link
We examined both scenarios. Our university had cluster time to buy, and there is always the Amazon cloud. In our calculation, getting a 16 thread machine from Dell paid for itself in under six months of continuous running, and would not require a large adjustment in the way people were currently coding (i.e. staying in Windows rather than moving to Linux), and could also be passed down the research group when newer hardware is released.If you are using production level code and manipulating it each time to get results, and you can guarantee the results will be good each time, then power-by-the-hour could work. As we were constantly writing and testing new code for different scenarios, the build/buy your own workstation won out. Having your own system also helps in building GPU codes, if you want to buy a better GPU card it is easier to swap out rather than relying on a cloud computing upgrade.
Ian
jtv - Sunday, January 6, 2013 - link
One big consideration is who the researchers are. I work in x-ray spectroscopy (as a computational theorist). Experimentalists in this field use some of our codes without wanting to bother with having big computational resources. We have looked at trying to provide some of our codes through some cloud-based service so that it can be used on demand.Otherwise I would agree with Ian's reply. When I'm improving code, debugging code, or trying to implement new theoretical approaches I absolutely want my own hardware to do it on.
mayankleoboy1 - Saturday, January 5, 2013 - link
Ian :How much difference do you think Xeon Phi will make in these very different type of Computations?
Will buying a Xeon Phi "pay itself out" as you said in the above comments ? (or is xeon phi linux only ?)
IanCutress - Saturday, January 5, 2013 - link
As far as we know, Xeon Phi will be released for Linux only to begin with. I have friends who have been able to play with them so far, and getting 700 GFlops+ in DGEMM in double precision.It always comes down to the algorithm with these codes. It seems that if you have single precision code that doesn't mind being in a 2P system, then the GPU route may be preferable. If not, then Phi is an option. I'm hoping to get my hands on one inside H1 this year. I just have to get my hands dirty with Linux as well.
In terms of the codes used here, if I were to guess, the Implicit Finite Difference would probably benefit a lot from Xeon Phi if it works the way I hope it does.
Ian
mayankleoboy1 - Saturday, January 5, 2013 - link
Rather stupid question, but have you tried using PGO builds ?Also, do you build the code with the default optimizations, or use the MSVC equivalent switch of -O2 ?
IanCutress - Saturday, January 5, 2013 - link
Using Visual Studio 2012, all the speed optimisations were enabled including /GL, /O2, /Ot and /fp:fast. For each part I analysed the sections which took the most time using the Performance Analysis tools, and tried to avoid the long memory reads. Hence the Ex-FD uses an iterative loading which actually boosts speed by a good 20-30% than without it.Ian
Klimax - Sunday, January 6, 2013 - link
Interesting. Why not Ox (all optimisations on)BTW: Do you have access to VTune?
IanCutress - Wednesday, January 9, 2013 - link
In case /Ox performs an optimisation for memory over speed in an attempt to balance optimisations. As speed is priority #1, it made more sense to me to optimise for that only. If VS2012 gave more options, I'd adjust accordingly.Never heard of VTune, but I did use the Performance Analysis tools in VS2012 to optimise certain parts of the code.
Ian
Beenthere - Saturday, January 5, 2013 - link
Business and mobo makers do not use 2P mobos to get high benches or performance bragging rights per se. These systems are build for bullet-proof reliability and up time. It does no good for a mobo/system to be 3% faster if it crashes while running a month long analysis. These 2P mobos are about 100% reliability, something rarely found in a enthusiasts mobo.Enterprise mobos are rarely sold by enthusiast marketeers. Newegg has a few enterprise mobos listed primarily because they have started a Newegg Biz website to expand their revenue streams. They don't have much in the line of true enterprise hardware however. It's a token offering because manufacturers are not likely to support whoring of the enterprise market lest they lose all of their quality vendors who provide customer technical product support.
psyq321 - Sunday, January 6, 2013 - link
Actually, ASUS Z9PE-D8 WS allows for some overclocking capabilities.CPU overclocking with 2P/4P Xeon E5 (2600/4600 sequence) is a no-go because Intel explicitly did not store proper ICC data so it is impossible to manipulate BCLK meaningfully (set the different ratios). Oh, and the multipliers are locked :)
However, Z9PE D8 WS allows memory overclocking - I managed to run 100% 24/7 stable with the Samsung ECC 1600 DDR3 "low voltage" RAM (16 GB sticks) - just switching memory voltage from 1.35v to 1.55v allows overclocking memory from 1600 MHz to 2133 MHz.
Why would anyone want to do that in a scientific or b2b environment? The only usage I can see are applications where memory I/O is the biggest bottleneck. Large-scale neural simulations are one of such applications, and getting 10 GB/s more of memory I/O can help a lot - especially if stable.
Also, low-latency trading applications are known to benefit from overclocked hardware and it is, in fact, used in production environment.
Modern hardware does tend to have larger headrooms between the manufacturer's operating point and the limits - if the benefit from an overclock is more benefitial than work invested to find the point where the results become unstable - and, of course, shorter life span of the hardware - then, it can be used. And it is used, for example in some trading scenarios.
Drazick - Saturday, January 5, 2013 - link
Will You, Please, Update Your Google+ Page?It would be much easier to follow you there.
Ryan Smith - Saturday, January 5, 2013 - link
Our Google+ page is just a token page. If you wish to follow us then your best option is to follow our RSS feeds.toyotabedzrock - Saturday, January 5, 2013 - link
There is a large number of very smart people on Google+. You really should come join us.JlHADJOE - Tuesday, January 8, 2013 - link
Of course there are lots of smart people on G+! You're all google employees right? =PActivate: AMD - Saturday, January 5, 2013 - link
As a fellow chemist, I must say that you have to be some kind of nut to want to do computational/physical chemistry. If you need me, I'll be at the bench!Good article too!
engrpiman - Saturday, January 5, 2013 - link
I didn't read the article in full but what I did read was top notch. I found your simulations and mathematics very interesting. I took a Physics class which was focused in writing code to run mathematical simulations . Using the given java lib. I wrote my own code to calculate PI. When I returned from the gym the program had calculated 3.1 . I then re-wrote the program from scratch and ditched the built in libs. and reran. I had 20 decimals in 30 sec it was an epic improvement.All in all I think your article could be very useful to me.
Thanks for writing.
SodaAnt - Saturday, January 5, 2013 - link
THIS is why I read Anandtech. I'll admit that I wasn't quite in the mood to read all the equations (I'll have to do that later), but really, these kind of reviews make my day.Cardio - Saturday, January 5, 2013 - link
Wonderful review...as always. ThanksHakon - Saturday, January 5, 2013 - link
Hi Ian. Thanks for the nice article. I have one suggestion regarding the explicit finite difference code:You could try to reorder the loops such that the memory access is more cache friendly. Right now 'pos' is incremented by NX (or even NX*NY in 3D) which will generate a lot cache misses for large grids. If you switch the x and the y loop (in the 2D case) this can be avoided.
IanCutress - Saturday, January 5, 2013 - link
Either way I order the loops, each point has to read one up, one down, one left and one right. My current code tries to keep three as consecutive reads and jump once, keeping the old jump in local memory. If I adjusted the loops, I could keep the one dimension in local memory, but I'd have to jump outside twice (both likely cache misses) to get the other data. I couldn't cache those two values as I never use them again in the loop iteration.When I did this code on the GPU, one method was to load an XY block into memory and iterate in the Z-dimension, meaning that each thread per loop iteration only loaded one element, with a few of them loading another for the halo, but all cache aligned.
I hope that makes sense :)
Ian
Hakon - Saturday, January 5, 2013 - link
Yes, but when you access the array 'cA' at 'pos' the CPU will fetch the entire cache line (64 byte in case of your machine, i.e. 16 floats) of the corresponding memory address into the CPU cache. That means that subsequent accesses to say 'pos + 1', 'pos + 2' and so on will be served by the cache. Accessing an array in such a sequential manner is therefore fast.However, when you access an array in a nasty way, e.g. 'NX + x' -> '2*NX + x', -> '3*NX + x', then each such access implies a trip to main memory if NX happens to be large.
That you need to move up / down and sideways in memory does not matter. When you write down the accesses of the code with the reordered loops you will notice that they just access three "lines" in memory in a cache friendly way.
Not reusing the old values of the last iteration should not affect performance in a measurable way. Even if the compiler fails to see this optimization, the accesses will be served by the L1 cache.
Btw, did you allocate the array having NUMA in mind, i.e. did you initialize your memory in an OpenMP loop with the same access pattern as used in the algorithms? I am a bit surprised by the bad performance of your dual Xeon system.
IanCutress - Saturday, January 5, 2013 - link
Memory was allocated via the new command as it is 1D. When using a 2D array the program was much slower. I was unaware you could allocate memory in an OpenMP way, which thinking about it could make the 2D array quicker. I also tried writing the code using the PPL and lambdas, but that was also slower than a simple OpenMP loop.I'm coming at these algorithms from the point of view of a non-CompSci interested in hardware, and the others in the research group were chemists content to write single threaded code on multi-core machines. Transferring the OpenMP variations of that code from a 1P to a 2P, as the results show, give variable results depending on the algorithm.
There are always ways to improve the efficiency of the code (and many ways to make it unreadable), but for a large part moving to the 2P system all depends on how your code performs. Please understand that my examples being within my limits of knowledge and representative of the research I did :) I know that SSE2/SSE4/AVX would probably help, but I have never looked into those. More often than not, these environments are all about research throughput, so rather than spend a few week to improve efficiency by 10% (or less), they'd rather spend that money getting a faster system which theoretically increases the same code throughput 100%.
I'll have a look at switching the loops if I write an article similar to this in the future :)
Ian
Hakon - Saturday, January 5, 2013 - link
Thank you for the detailed answer. I very much appreciate your article and hope to see more stuff like this on Anandtech.What I meant regarding to NUMA is the following. When you have a dual socket Xeon you have two memory controllers. The first time you 'touch' a memory location it is assigned to the memory controller of the CPU that runs the current thread. This assignment is in general permanent and all further memory read/writes to that location will be served by that memory controller.
If you first-touch (e.g. initialize the array to zero) using one thread, then the whole array is assigned to one of the two memory controllers. When you then run the multi-threaded code on that array one memory controller is idle while the other is oversubscribed since it has to serve both CPUs.
In contrast, if you first-touch your array in an OpenMP loop and use the same access pattern as in the algorithm, you will benefit from both memory controllers later on. In this case your large array is correctly 'distributed' over both memory controllers.
This kind of memory layout optimization becomes extremely important when you deal with quad socket Opterons. You then have eight memory controllers. A NUMA aware code is therefore up to eight times as fast since it utilizes all memory controllers.
toyotabedzrock - Saturday, January 5, 2013 - link
You should go ask the people on the assembly boards for help with making your code faster.They are very friendly compared to a Linux kernel devs, I think they just enjoy the acknowledgement that they still exist and are useful.
snajpa - Saturday, January 5, 2013 - link
Blame the scheduler. Neither Windows nor linux can effectively handle larger NUMA systems. It randomly moves the process across the physical hardware.psyq321 - Sunday, January 6, 2013 - link
Hmm, this is definitely not true at least for Windows Server 2008 R2 / Windows 7, and I am sure it holds true for some versions of Linux (I am not a Linux expert).Windows Server 2008 R2 / Windows 7 scheduler will try to match the memory allocations (even if they are not tagged for a specific NUMA node) with the NUMA node the process/thread resides on, and they will not move a thread to a foreign NUMA node unless if that has been explicitly requested by the application (by setting the thread affinity)
Of course, without explicit NUMA node tagging when doing allocations, application code is the main culprit for not respecting the NUMA layout (e.g. creating bunch of threads, allocating memory from one of them - and then pinning the threads to different CPUs - you will have lots of LLC requests from remote DRAM because memory was a-priori allocated on one node).
For this - some sane coding helps a lot, here:
http://www.dimkovic.com/node/15
I describe how I extracted more than double performance by careful memory allocation (NUMA-aware) - please note that neither Windows nor Linux scheduler is able to cope with code which is not written to be NUMA aware and it is using large number of threads that are supposed to run on all CPUs.. Simply put, application writer will have to manage memory allocation and usage in the way so that there are as little remote DRAM requests as possible.
snajpa - Sunday, January 6, 2013 - link
About Windows scheduler - I only worked with Windows XP, now I don't have any reason to work with Win anymore, so what you say probably is really true. As for the linux versions - well, long story short, CFS sucks and everyone knows it - this is particularly noticeable if you have fully virtualized VMs which appear as one single process at the host system - the process is randomly swapped between CPU cores and even CPU dies.... sad story. That's why people have to pin their CPUs to their tasks manually.psyq321 - Sunday, January 6, 2013 - link
Ah, XP - that explains it. True, XP did not care about NUMA at all.Windows Server 2008 / Vista introduced NUMA-aware memory allocations, and changed their CPU scheduler so it does not move the thread across NUMA nodes. They will also try to allocate the memory from the thread's own NUMA node when legacy VirtualAlloc etc. APIs are used.
Windows Server 2008 R2 / Windows 7 introduced the concept of CPU groups - allowing more than 64 CPUs. This does require some adaptation of the application, as old threading APIs only work with 64-bit affinity bitmask which only allowed recognizing 64 CPUs. Now, there is a new set of APIs that work with GROUP_AFFINITY structure, allowing control of CPU groups, too. However, this needs explicit change of the legacy process/threading APIs to the new ones.
Furthermore, none of the above can replace some manual intervention*- while Windows scheduler will, indeed, respect NUMA node boundaries and not try to mess around with moving threads across them - it still does not know what the underlying algorithm wants to do.
* There is no need to set the thread affinity to one specific CPU anymore - this prevents running the thread on any other CPU completely. Instead, there is an API called SetThreadIdealProcessor(Ex) which signals Windows scheduler that thread >should< run on that particular CPU - but, under certain circumstances the scheduler can move the thread somewhere else - if the CPU is completely taken away by some other thread/process. Scheduler will try to move the thread as close as possible - to the next core in the socket, for example - or to the next core in the group (group is always contained within a NUMA node).
You can, however, absolutely forbid Windows scheduler from passing the thread to another NUMA node under any circumstances by simply getting the said NUMA node affinity mask (GetNumaNodeProcessorMask(Ex)) and setting this affinity as a thread affinity. This + setting the "ideal" processor still gives Windows scheduler some headroom to move the thread to another core if it is found to be better in a given moment, but it will not even attempt to cross the NUMA boundary in any case whatsoever.
lmcd - Monday, January 7, 2013 - link
While I haven't personally researched them, there are tons of other schedulers that have been written for Linux and I'm certain *at least* one of them is more fitting to this line of work. I've heard of alternatives like BFS and the Linux kernel is so widely used I'm sure there's a gem out there for this application.toyotabedzrock - Saturday, January 5, 2013 - link
Have you ever tried the Intel Math Kernel Library? It might speed up some of the equations. It also hands off work to the Intel MIC card if it thinks it will speed it up.http://software.intel.com/en-us/intel-mkl/
KAlmquist - Saturday, January 5, 2013 - link
The GA-7PESH1 motherboard is $855, and the CPU's are $2020 each, which adds up to $4895. On tasks which don't parallelize well, you can get similar performance from the i7-3770K, which costs an order of magnitude less. (Prices: i7-3770K $320, ASRock Z77 Extreme6 motherboard $152, total from motherboard and CPU $472.) On tasks which parallelize well enough that they can be run on a GPU, the system with the GA-7PESH1 will beat the i7-3770K, but will be crushed by a midrange GPU. So the price/performance of this system is pretty bad unless you throw just the right workload at it.The motherboard price from super-laptop-parts dot com, and the other prices are from a major online retailer that I won't name in order to get around the spam filter.
Death666Angel - Saturday, January 5, 2013 - link
So, your 3770K has ECC memory or VT-D, TXT etc.?nevertell - Sunday, January 6, 2013 - link
The K version may not, but the standard i7-3770 does in fact support VT-D, TXT and ECC memory from the get go. Vt-D has to be also supported by the motherboard, which may be problematic on consumer motherboards. I have a i5-2400 myself, and Vt-d is a pain to setup and to this day I still haven't found out whether is it that I am unable to set up Xen properly or just that my cheap motherboard worn't support VT-d, to properly assign a video card to a virtual machine.KAlmquist - Sunday, January 6, 2013 - link
The 3770K lacks those features, but that doesn't invalidate my point.Using ECC memory improves system availability, and likely decreases the probability of undetected errors resulting in incorrect computations. If these are important to you, then you should be thinking about full double or triple redundancy. Why not buy three 3770K based systems and run the same simulation on all of them? Most of the time you will get identical results on all three systems, but on rare occasions one of the systems will die during the run. No problem; you have the simulation results from the other two systems. On even rarer occasions, one of the systems will produced an incorrect result due to an undetected bit error. Again no problem; you take the results from the two simulations that agree.
With full redundancy it doesn't matter where in the system the error occurs because full redundancy addresses faults anywhere in the system. This makes it superior to ECC memory, which only addresses faults in the memory subsystem. So the only reason to go with ECC memory instead of full triple redundancy is if the ECC memory approach costs less. Based on the numbers I posted, you aren't going to get a lower cost based on hardware costs alone. Possibly you could get there by including administrative costs and the like.
I'm not saying that the system Ian tested wouldn't make sense under *any* circumstances. My point is that the system has a poor price performance ratio, so it only makes sense when a lot of things are working in its favor.
The second feature you mention is VT-D, which makes it more efficient to emulate device hardware in virtual machines. I don't have any benchmarks, but my guess is that the performance improvement from VT-D is fairly small. In any case, if you want VT-D you can buy the 3770 rather than the 3770K. You can't overclock the 3770, but my comments about the 3770K offering "similar performance" were based primarily on the performance of the 3770K at stock frequency. If you assume that everyone is going to take the time to find an optimal overclock for their CPU, then the E5-2690 (which cannot be overclocked) looks even worse.
I suppose it's off topic to debate the merits of "trusted execution technology" here, so I will simply note that if for whatever reason you want a processor that supports it, the solution is the same as for VT-D: get the 3770 instead of the 3770K.
Kevin G - Saturday, January 5, 2013 - link
A very well written article that sticks toward its purpose: scientific computing. Really pleased to see articles like this on the site even if I have a few minor quibbles.On page 2 "To those unfamiliar with server boards, of note is the connector just to the right of center of the picture above." is either oddly worded to describe the front panel connector at the bottom the board (which is indeed right of center but not in the center of the picture) or describing a connector that isn't even documented in the manual. For clarification I'm looking at the connector just right of the top PCI-E 16x slot (above and to the left of the battery). Actually, what is that connector labeled as? I've seen it on other Xeon boards but have never seen it used.
The last paragraph on page 2 should read omits the possibility of nonbuffered ECC memory and implies the usage of unbuffered non-ECC memory. I haven't found confirmation that this board can accept unbuffered, non-ECC memory (opposed to the possibility of an ECC requirement as some server vendors enforce).
A couple of notes on the little processor talk on page 6. Dealing with cache thrashing between L3 and L2 is possible but when dealing with a high number of threads general coherency becomes a bigger factor. The overhead is beginning to exceed the benefit of having the additional hardware to run them. If you're lucky to be dealing with an algorithm that doesn't need such coherency overhead, then chances are it is very ideal for GPU compute (and memory capacity isn't a factor). A minor nicety would have been to see some more testing without Hyperthreading on the i7-3770k, i7-3930k, and i7-3960X to better indicate scaling with/without Hyperthreading. I suspect that those single socket processors would have been able to show some small gains with Hyperthreading where the dual socket system did not.
An extension to the L2/L3 cache talk on page 6 is the move to dual sockets and NUMA. There is a performance penalty due to latency for having one thread access memory that is found on a remote socket. Memory mirroring between sockets can eliminate that remote penalty while increasing RAS but at the cost of halving effective memory capacity. The manual isn't clear if mirroring mode or the lockstep mode is across different sockets (it can be done across memory channels as well).
I'd also would have loved to have heard some comparisons with the Gigabyte GA-X79S-UP5. While the name implies an X79 chipset, it uses the C606 chipset. It'll support ECC memory with socket 2011 Xeons and plenty of over clocking features (for the daring). Comparing the GA-7PESH1 to the GA-X79S-UP5 would have been able to answer if the move to dual sockets would have been worth the extra cost.
Hakon - Saturday, January 5, 2013 - link
Somehow does read like an anonymous peer review :-)Kevin G - Sunday, January 6, 2013 - link
A little bit. :)Part of my criticism isn't about the article itself but rather the general state of massively multithreaded hardware and software. The hardware portion is quickly running into software limitation that were never expected to be reached in the professional space. A decade ago who thought that a scientist could purchase a 240 simultaneous thread processor that would fit on a mere expansion card? In some cases we don't reach Amdal's Law before hitting an artificial barrier due to scheduling or coherency overhead.
I just noticed that the system was using Win 7 Professional which has a limit of 64 concurrent threads per process. A quad socket LGA 2011 config would actually be at the very limit of what Window 7 (or rather 2008R2 since professional only scales to two sockets) can handle. OpenMP can handle more than 64 concurrent threads but on Windows it has to submit this limitation.
psyq321 - Sunday, January 6, 2013 - link
As for the GA-X79S-UP5 Clocking features are only working for 1P Xeons, which are basically similar to HEDT i7 (36xx) line. With those, customer has an advantage of ECC RAM support and still some overclocking headroom.Clocking 2P/4P Xeons E5 (sadly, these are the only 8-core parts so far) is next to impossible due to the lack of ICC configuration data allowing changing BCLK ratios. These Xeons can only be bumped by direct BCLK increase, which is dangerous above few MHz. At most, 5-6 MHz is feasible as tested on ASUS Z9PE-D8-WS and EVGA SR-X boards.
Memory overclocking is another matter, completely. I have excellent results with Samsung's 1.35v ("low voltage") ECC RAM. It is not just the cheapest 16 GB ECC option (~$160 for the 16 GB ECC stick last time I checked, I got mine for 140 EUR in Germany 7 months ago), but it is the fastest while still keeping the low voltage. This RAM can be overclocked to 2133 MHz by a simple voltage bump to 1.55v, which is still within Xeon's VSa limits.
Kevin G - Sunday, January 6, 2013 - link
Weird that Intel doesn't provide the ICC configuration data. The 'gear ratio' change is something I'd still expect to change on true X79 boards regardless of processors (I can see Intel crippling this on C600 series). Then again, I've heard some weird situations with LGA 2011 Xeons in desktop boards. There are some scattered reports of unlocked chips but as the internet goes there are lots of speculation and rumors but little real confirmation.Those Samsung 16 GB ECC sticks are registered? I thought that the GA-X79S-UP5 didn't registered DIMMs.
As for the ability to overclock those low voltage DIMMs, not really surprised as they've historically been impressive in that regards. I have some older 4 GB 1.35v DDR3-1333 rated sticks that can go to 1866 Mhz at 1.5v. :) The timings had to be changed but still impressive.
PEPCK - Saturday, January 5, 2013 - link
Worth noting that the three miniSAS connectors yield 8 SAS and 4 SATA connectors in the specification table.krumme - Sunday, January 6, 2013 - link
For this article Ian get the über nerds Gold Award only given ones in a centurylowenz - Sunday, January 6, 2013 - link
A brilliant article.More of these, please.
dj christian - Monday, January 14, 2013 - link
No please!This article should be a one time only or once every 2 years at most.
nadana23 - Sunday, January 6, 2013 - link
From the looks of results some of the benchmarks are HIGHLY sensitive to effective bandwidth per thread (ie, GDDR5 feeding a GPU stream processor >> DDR3 feeding a Xeon HT core).However - it must be noted that 8x DIMMS is insufficient to achieve full memory bandwidth on Xeon E5 2S!
I'd suggest throwing a pure memory bandwidth test into the mix to make sure you're actually getting the rated number (51.2GB/s)...
http://ark.intel.com/products/64596/Intel-Xeon-Pro...
... as I strongly suspect your memory config is crippling results.
Dell's 12G config guidelines are as good a place as any to start on this :-
http://en.community.dell.com/cfs-file.ashx/__key/c...
Simply removing one E5-2590 and moving to 1-Package, 8 DIMM config may (counter-intuitively) bench(market) faster... for you.
dapple - Sunday, January 6, 2013 - link
Great article, thanks! This is the sort of benchmark I've been wanting to see for quite some time now - simple, brute-force numerics where the code is visible and straightforward. Too many benchmarks are black boxes with processor- and compiler-specific tunes to make manufacturer "X" appear superior to "Y". That said, it would be most illustrative to perform a similar 'mark using vanilla gcc on both MS and *nix OS.daosis - Sunday, January 6, 2013 - link
It is long known issue, when windows does not start after changing hardware, especially GPU (not always so). There is as long known trick so. Just before last "power off" one should replace GPU's own driver with basic microsoft's one. In case of GPU it is "standart Vga adapter" (device manager - update driver - browse my computer - let me pick up). In fact one can replace all specific drivers on OS with similiar basic from MS and then to put this hard drive virtually to any system without any need for fresh install. Mind you, that after first boot it takes some time for OS to find and install specific drivers.jamesf991 - Sunday, January 6, 2013 - link
In the early '70s I was doing very similar simulations using a PDP 11/40 minicomputer. (I can send citations to my publications if anyone is interested.) At Texas Tech and later at Caltech, I simulated systems involving heterogeneous electron transfer kinetics, various chemical reactions in solution, coulostatics, galvanostatics, voltammetry, chronocoulometry, AC voltammetry, migration, double layer effects, solution hydrodynamics (laminar only), etc. Much of this was done on a PDP 11/40, originally with 8K words (= 16K bytes) of core memory. Later the machine was upgraded to 24 K words (!), we got a floating point board, and a hard disk drive (5 M words, IIRC). My research director probably paid in excess of $50K for the hardware. One cute project was to put a simulation "inside" a nonlinear regression routine to solve for electrode kinetic parameters such as k and alpha. Each iteration of the nonlinear solver required a new simulation -- hand-coding the innermost loops using floating point assembly instructions was a big speedup!I wonder how the old PDP would stack up against the 3770?
flynace - Monday, January 7, 2013 - link
Do you guys think that once Haswell moves the VRM on package that someone might do a 2 socket mATX board?Even if it means giving up 2 of the 4 PCIe slots and/or 2 DIMMs per socket it would be nice to have a high core count standard SFF board for those that need just that.
samsp99 - Monday, January 7, 2013 - link
I found this review interesting, but I don't think this board is really targeted at the HPC market. It seems like it would be good as part of a 2U / 12 + 2 drive system, similar to the Dell C2100. It would make a good virtual host, SQL, active web server etc. Having the 3 mSAS connectors would enable 4 drive each without the need for a SAS expander.Servers are designed for 99.999% uptime, remote management, and hands-off operation. To achieve that you need redundent power, UPS, Networking, storage etc. They also require high airflow, which is noisy and not something you want sitting under your desk. Based on that, it makes sense that the MB is intended for sale to system builders not your general build your own enthusiast.
HW manufactuerers are faced with a similar problem to airlines - consumers gravitate to the cheapest price, and so the only real money to be made is selling higher profit margin products to businesses. Servers are where intel etc makes their profits.
For the computational problems the author is trying to solve, to me it would seem to be better to consider:
a) At one point, I think google was using commodity hardware, with custom shelving etc. Assuming the algorithms can be paralleled on different hosts, you shouldn't need the reliability of traditional servers, so why not use a number of commodity systems together, choosing the components that have the best perf/$.
b) There are machines designed for HPC scenarios, such as HPC Systems E5816 that supports 8x Xeon E7-8000 (10 core) processors, or the E4002G8 - that will take 8 nVidia Tesla cards.
c) What about developing and testing the software on cheap worstations, and then when you are sure its ready, buying compute time from Amazon cloud services etc.
babysam - Monday, January 7, 2013 - link
It is quite delighting to look at your review on Anandtech (especially when I am using software and computer configurations of similar nature for my studies), as it is quite difficult for me to evaluate the performance gain of "real-life" software (i.e. science oriented in my case) on new hardware before buying.From what I have seen in your code segments provided (especially for the n-body simulation part) , there are large amount of floating-point divisions. Is there any possibility that the code is not only limited by the cache size(and thrashing), but by the limited throughput of the floating-point divider? (i.e. The performance degradations when HT is enabled may also be caused by the competition of the two running threads on the only floating-point divider in the core)
SanX - Tuesday, January 8, 2013 - link
if you post zipped sources and exes for anyone to follow, learn, play, argue and eventually improve.I'd also preferred to see Fortran sources and benchmarks when possible.
Intel/AMD should start promote 2/4/8 socket monster mobos for enthusiasts and then general public since this is the beginning of the infinite in time era for multiprocessing.
Also where are games benchmarks like for example GTA4 which benefits a lot from multicores as well as from GPUs?
IanCutress - Wednesday, January 9, 2013 - link
The n-body simulations are part of the C++ AMP example page, free for everyone to use. The rest of the code is part of a benchmark package I'm creating, hence I only give the loops in the code. Unfortunately I know no Fortran for benchmarks.Most mainstream users (i.e. gamers) still debate whether 4 or 6 cores are even necessary, so moving to 2P/4P/8P is a big leap in that regard. Enthusiasts can still get the large machines (a few folders use quad AMD setups) if they're willing to buy from ebay which may not always be wholly legal. You may see 2P/4P/8P becoming more mainstream when we start to hit process node limits.
Ian
JonnyDough - Tuesday, January 8, 2013 - link
I don't know if I speak for everyone, but I would really love to see some gaming benchmarks.I realize that this system is not designed or optimized for gaming, but it would be interesting nonetheless to see what two processors does, or does not do for gaming. :)
npcomplete - Tuesday, January 8, 2013 - link
...it just gets to the meat of computing!Thanks for this article. It woke up the scientist in me.
esung - Wednesday, January 9, 2013 - link
I'm very curious as the result. Have you tried to bench 1 2690 vs 2 2690s? It almost like the benchmark are limited by CPU frequency instead of threads/cores it has.CodeToad - Saturday, January 19, 2013 - link
Ian - I really enjoyed reading your effort here. There is a large, and I think underserved, community of scientific users who need this kind of information. Digging through IEEE/ASM communications is often just too much. Doing the work here - or anywhere - is a real help.I'm a retired economist (PhD Chicago, '81) and (in my case) thankfully haven't done physical, much less computational, chemistry since undergrad. Never the less, we have similar technical needs.
I've become a huge fan of open source software. In my "home lab," which my wife calls The Frat House, some grad students and I have been diligently working with the R Language (statistics), nascent risk and optimization tools, and a mash-up of database, data warehouse, and "business intelligence" tools, all open source. The goal someday -- beat SAS silly and obviate that $100-300K price tag!
The more demure and do-able daily work is just cleaning up and optimizing open source code, contributing that back as individual packages. The "hits" and email indicates a good adoption rate.
Ian, CUDA is of big interest to the people we're in communication with, and I have to admit some real fascination personally. As you have real-world experience, how about a series of articles. I hope ANDATECH would support that work!!
Very best to you - hope to be "reading" you soon!!