Now, now, no reason to get so aggressive. Surely, all he did was abbreviate his suggestion a little, and he really meant to say:
Break into the SuperMicro offices, toss some Single-Slot Titans into the only four existing PCIe slots in this rather compact workstation, hook up a monitor, keyboard and mouse, find out what games will even install and run on Windows Server, and then explain to us in a lot of detail why games work even worse on multiple physical CPUs than some of those scientific benchmarks do. Start with something like "Well, turns out the Supermicro X9 doesn't actually support SLI.
It's a 20.000$ system : 1k for the board, 4x3k for the CPUs, annother 3k for the RAM, and then some for the SAS drives. The number of gamers considering to buy this is rather exactly 0 people. Testing what a multiple-CPU system will do to gaming is fun, but that's what what ASUS and EVGA have given us a number of Dual-Socket boards with SLI-support for. Installing and benchmarking a game on a Quad-CPU-Single-GPU Board with a C602 chipset would be a horrible waste of Ian in my book.
Love my job, since I've been bringing in $82h… I sit at home, music playing while I work in front of my new iMac that I got now that I'm making it online. (Home more information) http://goo.gl/Qev5Q
Of course and Why not test any actual GNU/Linux vs this MS WOS server 2012? more than 90% even 95% of servers are using Linux and this model must be tested with the target OSs to use, Cent OS - a Red Hat clone - would be the best choice, or not cheap RH EL and SUSE EE that surely you can have a free copy for testing and serve more the IT guys to pick - or not - this machine.
Benchmarking with MS WOS server 2012 after this results minds nobody is going to buy it for MS WOS server, and perhaps it canbe a good chopicefor a Linuxserver with some virtualized machines - even for a Xen or QEMU running MS WOS + Autocad with a VGA Passthrough configuration, that should be a great test.
So, the only manufacturer of Quad-CPU-Boards has absolutely no clue of Multi-CPU systems, and is consistently running the wrong OS on his own test installations? Windows Server is a profitable product for MS, it has an existant market-share (see, for an example, http://w3techs.com/technologies/overview/operating... ) and it does not exactly cripple Multi-CPU performance for software which does support it. Just look at the PovRay benchmark in this very article, or read some well-written material provided by MS on the topic: http://goo.gl/A6f23 .
Informing people about linux as an option, and clarifying its capabilities and benefits is something I can get behind, but being an obnoxious linux-fanboy won't convince anybody of anything.
Well, Windoze is a lame OS, no matter what the fanboys say. OTOH, SQL Server is a very good database, up to its limits. But that means using Windoze. If one goes the *nix way, then Oracle/DB2/Postgres are the databases to choose among.
Multiprocessor systems are more appropriate as heavy weight (for some definition of heavy) database machines. They can exploit CPU/RAM/SSD more than any other application.
And you probably didn't read the bit when he said he was given *remote* access to the server? He can't go about formatting everything like it's his own toy box.
He tried out multiple versions of Windows Server. He seems to be using a very serious version of Remote Desktop... either that, or he does in fact have access to someone at SuperMicro who can format things for him. But, the most important thing to say about this whole affair: Even SuperMicro, the builders of this desktop, could not get all 64 cores working on Windows.
He almost certainly used SuperMicro's IPMI 2.0 KVM-over-IP solution, which provides a remote desktop (including local optical storage and USB proxies) at the HW level. Doing BIOS setup and an OS install from remote DVD media (i.e. the media is physically at Ian's location instead of at the server) is a piece of cake.
You don't need to consider Windows Server in order to run that type of hardware, just get Win 8 on there since it can support up to 640 logical CPU's as I'm aware of. So... yea, I wish Linux gaming was benchmarkable but it really still isn't in terms of graphics performance, only CPU benchmarks would be meaningful but DEF not GPU testing in a Linux distro. And anyway, Phoronix does a wonderful job on the Linux side of benchmark land. http://blogs.msdn.com/b/b8/archive/2011/10/27/usin... To see some stupid extreme CPU Task manager action.
"The main issue moving to 4P was having an operating system that actually detected all the threads possible and then communicated that to software using the Windows APIs. In both Windows Server 2008 R2 Standard and 2012 Standard, the system would detect all 64 threads in task manager, but only report 32 threads to software."
I found your problem: Windows. When we look at the top500 list, you know what we don't see in cases of HPC? Windows. http://www.top500.org/statistics/list/ (and look at Operating Systems)
Would it kill Anandtech to use Linux once in awhile?
Sadly, it seems like it might. Which is ironic given the technical nature of this site -- seems like exploring BSD versus Linux or some complex breakdown would be within the scope of the site.
Given the coverage of Android, again, it seems relevant exploring the typical architecture of a Linux distribution, the future architecture, the current Android setup, and the current Chrome OS setup.
I'm hoping for an article, which kinda sucks as I watch pipelines pop by for Microsoft's late C++ efforts, among other failures.
This is actually an old Windows API issue. While a piece of software can scale to a near infinite number of threads per process (only limited by address space), the Windows scheduler will only run a maximum of 32 per process concurrently. Even MS SQL Server only supports a maximum of 32 threads per DB on a single system (MS SQL Server will spawn another process per DB to scale higher as necessary).
Though with 32 real cores, it may pay off to simply disable HyperThreading for better scaling.
To clarify, this seems to be an issue with 32bit software running on 64bit hardware and making windows API calls while running under WOW64. A good example is noted in the remarks from the API documentation for the GetLogicalProcessorInformationEx function which describes issues with passing a 64bit KAFFINITY structure to a 32bit client and the side effects that can cause. http://msdn.microsoft.com/en-us/library/windows/de...
As noted in the article by the author, creating software that benefits from NUMA rather than being hamstrung by NUMA requires another layer of knowledge on top of single cpu software development. I'm sure Microsoft has figured out NUMA with MS Sql Server considering the prevalence of multi-cpu solutions for that software product essentially since multi-cpu hardware for windows became common. Note TPC result id 112032702 for NEC running Windows Server R2 Enterprise, and SQL Server 2012 Enterprise on 8 processors, 80 cores, and 160 threads.
The keyword is "processor groups", and APIs that deal with group affinity.
So, I would suggest to the reviewer to get acquainted with this if he intends to keep using Windows Server 2012 (or later) as the test vehicle.
In the Xeon E5 case based on Sandy Bridge-EP this should still not be a problem, as long as the reviewer use 64-bit processes, because Xeon EP 4600 does not support more than 64 logical CPUs.
However, Ivy Bridge EP already can have more than 64 logical processors with the E5 4600 v2 line. Having more than 64 logical CPUs was already possible with Xeon E7 platform based on Boxboro generation, and it will get even more scalable with Ivy Bridge EX.
That processors are grouped is more important than the number of processors. For NUMA architectures, all logical processors belonging to a physical CPU (with or without hyperthreading) will belong to the same group. The SetProcessAffinityMask() Windows function can be used to prevent the scheduler from assigning the process's threads a logical processor that doesn't belong to the same group. This way all threads in that process always run on cores that have the same fast memory access.
The process affinity mask essentially allows using a subset of the NUMA hardware as if it were a SMP system. If you have, say, 4 processor groups, then you have to manually divide the data up into 4 sections handled by 4 processes so that each group of threads operates on its own section with SMP memory access. MPI is then used to tie the 4 processes together just like using a cluster. The difference is that the message passing on the NUMA system is faster than on a cluster of separate physical servers, but basically it maps the NUMA system as a cluster of independent SMP systems.
Data dependent algorithms will greatly benefit from using the process affinity mask. Since a system like this doesn't make sense for data independent algorithms, ( where GPU hardware would be faster and cheaper), only software designed for NUMA systems should be compared.
This is more a statement of why unified memory and cache are important to performance computing. I'd like to note that the 6-core 3930X beat the 4770K on all but the few single threaded benchmarks, and the Xeon 8-core (I think it's 8-core?) beat the 3930X.
There are plenty of applications that scale up with core count. They just don't scale up with multiple sockets and slow interconnects between those cores.
bbb but, 25.6 GB/s QPI is supposed to be good <Not in 2014>, we dont need no lowest power stinkin NoC (Network On Chip) at 1Terabit/s,2Terabit/s like those ARM interconnects today
"Intel describes the data throughput (in GB/s) by counting only the 64-bit data payload in each 80-bit "flit". However, Intel then doubles the result because the unidirectional send and receive link pair can be simultaneously active. Thus, Intel describes a 20-lane QPI link pair (send and receive) with a 3.2 GHz clock as having a data rate of 25.6 GB/s. A clock rate of 2.4 GHz yields a data rate of 19.2 GB/s. More generally, by this definition a two-link 20-lane QPI transfers eight bytes per clock cycle, four in each direction.
The rate is computed as follows:
3.2 GHz × 2 bits/Hz (double data rate) × 16(20) (data bits/QPI link width) × 2 (unidirectional send and receive operating simultaneously) ÷ 8 (bits/byte) = 25.6 GB/s "
I notice that Anandtech tries to appeal to both industrial and enthusiast circles, and I appreciate how hard that is. It seems like this article is targeted at the industrial/HPC segment, however, and I think that a standard benchmark for HPC should include some codes frequently used in HPC. Everyone knows that Gaussian will leave a horse's head on your pillow if you try to benchmark their software, but you could easily run a massive DFT with the parallelized GAMESS, and I've seen previous articles benchmark Monte Carlo codes. Both chemists and wall street types would be interested in that. CFD programs are very popular with engineers; Openfoam is a popular option.
Monte Carlo is pretty much the definition of perfect scaling, as there are no dependencies between individual runs / setups / whatever. And you need many of them for statistics anyway.
Yes, Monte Carlo codes are theoretically infinitely parallelizable, though as mentioned previously, often specific implementations do not meet that ideal. Large CFD jobs are also well-parallelizable for some portions of the calculation. 3DS's Abaqus can auto-partitition large models and process each in a separate thread, for instance.
as a scientist myself I would be very interested to see how this scales with standard matrix operations using matlabs parallel computing toolbox. I have noticed that on our grids (xen domains running torque for queuing) the only real speed advantage has been in tesla GPU compute. The CPUs can take care if the overhead of a grid but essentially it comes down to programming as stated in the article. Custom code is the only way, and in most scientific applications the only availability.. Thus, testing high level languages with inherent multiproc support (parfor etc.) would be suuuper interesting to see. Thank you for the great read.
Yes. It is problem specific. A data independent operation, such as a matrix multiplication, linear transforms, etc. are far better suited to GPU compute. But consider a problem solved by an iterative calculation where the result of one iteration depends on the result(s) of previous iterations. GPU compute is inherently unsuited for such data-dependent problems. Many real world problems have a data dependence, but the dependence is on a separate calculation that is itself data-independent.
Even within the HPC world, the hardware choice depends on the problems the system is to be used for. But to aim for as general purpose a system as can be had, it makes sense to use something like this 4-processor board along with several Tesla cards in its PCIe slots.
So the bottom line is that a HPC benchmark suite should contain a mix of problems. A simple matrix multiply will always be unfairly weighted towards GPU compute and will not be representative of a system's general HPC capabilities.
as a scientist if you cant program an optimal assembly routine from your individual C routines as per the optimal x264 coding style to use assembly with a C fallback and check, then at least look to using the far more optimal http://julialang.org/ in place of matlab to increase all your algorithms data throughput, and upstream all your speed/quality improvements to that open code base
I like the F@H shoutout. There are certainly more than "a few" users running 4p setups. I'd put it at about 50 to 200 users based on the statistics for how many users are producing at the 500k ppd level most commonly attained with these setups. Many of those users have multiple 4p boards as well.
It is not a trivial process to take full advantage of these systems with F@H. The user community has worked to select the ideal Linux kernels and schedulers for this software as well as created custom utilities to improve hardware efficiency. TheKraken is a software wrapper that locks threads from the F@H client to specific CPUs to prevent excessive memory transfer between CPUs. Another user created tool called OCNG is a custom BIOS and software utility that allows Supermicro 4p G34 boards to overclock CPUs and adjust memory timings.
To use the full performance of 4p systems F@H users needed to go much farther than loading up Windows and running the provided executable designed for single CPU systems.
From looking at Ian's solver results, I think that there are actually (at least) two problems, and perhaps a third:
1. As he acknowledges, he isn't doing any sort of NUMA optimization
2. His overall rates and the obvious senstivity to DDR speed/latency indicate that he probably didn't do much cache-blocking (at its most basic level this involves permuting the order in which elements are processed in order to optimize data access patterns for cache). If that's the case then he would end up going out to DDR more than he should, which would make his code highly sensitive to the latency impacts of NUMA.
3. He may also have some cache-line sharing problems, particularly in the 3D case (i.e. cache lines that are accessed concurrently by multiple threads, such that the coherency protocol "bounces" them around the system). That's the most likely explanation for the absolutely tragic performance of the 4P system in that benchmark.
The importance of cache blocking/optimization can't be overstated. I've seen several cases where proper cache blocking eliminated the need for NUMA optimization. An extra QPI hop adds ~50% to latency in E5-based systems, and that can be tolerated with negligible performance loss if the application prefetches far enough ahead and has good cache behavior.
Ian, would you be willing to share the source code for one or more of your solvers?
One additional question for Ian: You state that your finite-difference solvers use "2^n nodes in each direction".
Does this mean that the data offsets along the major axis (or axes in the 3d case) are also integer multiples of a large power of 2^n? For example, if you have a grid implemented as a 2D array named 'foo', what is the offset in bytes from foo[0][0] to foo[1][0]?
If those offsets have a large power-of-2 factor, then that would lead to pathological cache behavior and would explain the results you're getting. Experienced developers know to pad such arrays along the minor axis or axes. For example, if I wanted to use a 1024 x 1024 array, I might allocate it as 1024 x 1056 instead. The purpose of the extra 32 elements along each row is to ensure that consecutive rows don't contend for the same cache line.
Supermicro IPMI solutions are generally well regarded and support remote OS installs. I assumed prior familiarity drove OS selection for better or worse.
Really interesting article. I've written several implementations of Finite Difference solvers, and used both COTS and Open Source solvers for parallel machines. I'm really surprised by the results, but I really agree with the conclusion, of you don't write your software appropriately you won't take advantage of the hardware at your disposal.
I know it's outside the scope of this article, but I would be really interested to see a comparrison of this 4x processors machine to a 'cluster' of two dual core machines. Ideally it would be awesome to see 2 Sci Linux clusters, one with 4 2x Xeons systems, and 1 with 2 4x Xeon systems. Put the same amount of RAM / core in both rigs and run computational benchmarks. When it comes down to purchasing hardware for a large cluster, looking for the price and performance break point is important. I would imagine that having more threads per machine would be faster then having to run your data over Infiniband (or something like it).
Ian, do you have any idea how your code or these tests might run on an SGI UV 20 or 2000, given they have a hardware MPI system and other features to aid with NUMA systems? The UV 20 is a quad-socket blade with up to 1.5TB RAM, while the 2000 scales to 256 sockets and up to 64TB RAM. They both use the XEON E5-4600 series.
Maybe you could ask SGI if you could do a remote access test on one of their UVs?
Hi All, a letely we do some test on our photogrametric sw, and we stumbled on performance issues with Win2012 Datacenter editionon our DualXeon setups, ( http://www.agisoft.ru/forum/index.php?topic=1330.0 ) in short in W2012 is something not OK with the performance of sw, if we do same test on Win7, or XP the same hw is much more faster, up to 70% ( Hyperthreading stuff ) . Could we do more indepht benchmark/problem solving article put together ?? this could help a lot of people in realworld app usage.....
It's normally silly to use such systems for embarrassingly parallel problems. With those problems you should use multiple far cheaper computers and get more performance for the $$$.
These sort of systems are for those "scale vertically" problems.
In the real world, any heavy threading and computing workload wouldn't be running on Windows. There is a reason that large supercomputers use Linux, its much better at handling large NUMA systems.
In the future can you please try Linux? I think Linux can do a far better job than Windows. MS Windows Server environment is not that suitable for such benchmarks. And usually for more than 4p Server you use Enterprise Ed not Standard. Sorry, this is just an advice not mandatory, but please try Linux
a big raspberry has to go to Ian Cutress himself for coming out of the idiot closet with such reckless abandon and to Anand for hiring his worthless ass to write for his site.
i want to know what kind of crack one needs to be smoking to get their hands on a 4 way hyperthreaded octo core setup and decide to run a benchmark that uses 720p mpeg-2 as it's source and 4mb/s 720p x264 with the very fast setting (i think that's the one they use) as the target?!?
if you really wanted to stress all the logical cores, a custom benchmark should have been used with all of the x264 settings maxed out and a much higher resolution, maybe even a 4k source, so that we can see some separation between multi cpu and single cpu setups.
seriously, who in there right mind would build this kind of system and then encode 4mb/s 720p avc?
get your head out of your Klavin and learn how to review a damn system.
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
53 Comments
Back to Article
nathanddrews - Wednesday, July 3, 2013 - link
You tease! Hook up a monitor, throw 4x Titans in there and put up some gaming benchmarks! ;-)jibberegg - Wednesday, July 3, 2013 - link
^ This. 4K article again! :)Robert Pankiw - Wednesday, July 3, 2013 - link
He only had remote access to the system, he mentions that he wanted to do gaming benches, but didn't have access.nathanddrews - Wednesday, July 3, 2013 - link
Hence my request that he connect a monitor and toss in some Titans...extide - Wednesday, July 3, 2013 - link
Wow, are you just ignorant or did you not read the article at all?ShieTar - Thursday, July 4, 2013 - link
Now, now, no reason to get so aggressive. Surely, all he did was abbreviate his suggestion a little, and he really meant to say:Break into the SuperMicro offices, toss some Single-Slot Titans into the only four existing PCIe slots in this rather compact workstation, hook up a monitor, keyboard and mouse, find out what games will even install and run on Windows Server, and then explain to us in a lot of detail why games work even worse on multiple physical CPUs than some of those scientific benchmarks do. Start with something like "Well, turns out the Supermicro X9 doesn't actually support SLI.
It's a 20.000$ system : 1k for the board, 4x3k for the CPUs, annother 3k for the RAM, and then some for the SAS drives. The number of gamers considering to buy this is rather exactly 0 people. Testing what a multiple-CPU system will do to gaming is fun, but that's what what ASUS and EVGA have given us a number of Dual-Socket boards with SLI-support for. Installing and benchmarking a game on a Quad-CPU-Single-GPU Board with a C602 chipset would be a horrible waste of Ian in my book.
SteveKen - Monday, July 8, 2013 - link
Love my job, since I've been bringing in $82h… I sit at home, music playing while I work in front of my new iMac that I got now that I'm making it online. (Home more information)http://goo.gl/Qev5Q
mmstick - Wednesday, July 3, 2013 - link
>> Windows ServerWell there is your problem, you aren't using Linux; the only logical choice for servers, especially multiprocessor and clusters.
coder543 - Wednesday, July 3, 2013 - link
Exactly.mitcoes - Thursday, July 4, 2013 - link
Of course and Why not test any actual GNU/Linux vs this MS WOS server 2012?more than 90% even 95% of servers are using Linux and this model must be tested with the target OSs to use, Cent OS - a Red Hat clone - would be the best choice, or not cheap RH EL and SUSE EE that surely you can have a free copy for testing and serve more the IT guys to pick - or not - this machine.
Benchmarking with MS WOS server 2012 after this results minds nobody is going to buy it for MS WOS server, and perhaps it canbe a good chopicefor a Linuxserver with some virtualized machines - even for a Xen or QEMU running MS WOS + Autocad with a VGA Passthrough configuration, that should be a great test.
ShieTar - Thursday, July 4, 2013 - link
So, the only manufacturer of Quad-CPU-Boards has absolutely no clue of Multi-CPU systems, and is consistently running the wrong OS on his own test installations? Windows Server is a profitable product for MS, it has an existant market-share (see, for an example, http://w3techs.com/technologies/overview/operating... ) and it does not exactly cripple Multi-CPU performance for software which does support it. Just look at the PovRay benchmark in this very article, or read some well-written material provided by MS on the topic: http://goo.gl/A6f23 .Informing people about linux as an option, and clarifying its capabilities and benefits is something I can get behind, but being an obnoxious linux-fanboy won't convince anybody of anything.
FunBunny2 - Friday, July 5, 2013 - link
Well, Windoze is a lame OS, no matter what the fanboys say. OTOH, SQL Server is a very good database, up to its limits. But that means using Windoze. If one goes the *nix way, then Oracle/DB2/Postgres are the databases to choose among.Multiprocessor systems are more appropriate as heavy weight (for some definition of heavy) database machines. They can exploit CPU/RAM/SSD more than any other application.
Friendly0Fire - Wednesday, July 3, 2013 - link
And you probably didn't read the bit when he said he was given *remote* access to the server? He can't go about formatting everything like it's his own toy box.coder543 - Wednesday, July 3, 2013 - link
He tried out multiple versions of Windows Server. He seems to be using a very serious version of Remote Desktop... either that, or he does in fact have access to someone at SuperMicro who can format things for him. But, the most important thing to say about this whole affair: Even SuperMicro, the builders of this desktop, could not get all 64 cores working on Windows.patrickjchase - Thursday, July 4, 2013 - link
He almost certainly used SuperMicro's IPMI 2.0 KVM-over-IP solution, which provides a remote desktop (including local optical storage and USB proxies) at the HW level. Doing BIOS setup and an OS install from remote DVD media (i.e. the media is physically at Ian's location instead of at the server) is a piece of cake.Kevin G - Wednesday, July 3, 2013 - link
I've reformatted and installed an OS before with only remote access. It really depends on what kind of remote access that limits a user.Heavensrevenge - Wednesday, July 31, 2013 - link
You don't need to consider Windows Server in order to run that type of hardware, just get Win 8 on there since it can support up to 640 logical CPU's as I'm aware of. So... yea, I wish Linux gaming was benchmarkable but it really still isn't in terms of graphics performance, only CPU benchmarks would be meaningful but DEF not GPU testing in a Linux distro.And anyway, Phoronix does a wonderful job on the Linux side of benchmark land.
http://blogs.msdn.com/b/b8/archive/2011/10/27/usin... To see some stupid extreme CPU Task manager action.
coder543 - Wednesday, July 3, 2013 - link
"The main issue moving to 4P was having an operating system that actually detected all the threads possible and then communicated that to software using the Windows APIs. In both Windows Server 2008 R2 Standard and 2012 Standard, the system would detect all 64 threads in task manager, but only report 32 threads to software."I found your problem: Windows. When we look at the top500 list, you know what we don't see in cases of HPC? Windows. http://www.top500.org/statistics/list/ (and look at Operating Systems)
Would it kill Anandtech to use Linux once in awhile?
coder543 - Wednesday, July 3, 2013 - link
Also, by the numbers.Linux in HPC: 83.4%
Windows in HPC: 0.6% (0.4% from Windows HPC 2008, 0.2% from Windows Azure)
Remember, in the top500 list, 0.2% is 1 supercomputer. You do the math.
coder543 - Wednesday, July 3, 2013 - link
and on 2nd glance, it looks like I didn't think to add all of the specific distros into the total.Linux in HPC has at least 93.4% of the total top500 market share.
lmcd - Wednesday, July 3, 2013 - link
And the rest is probably some combination of BSD or proprietary builds of BSD with "secret sauce."lmcd - Wednesday, July 3, 2013 - link
Sadly, it seems like it might. Which is ironic given the technical nature of this site -- seems like exploring BSD versus Linux or some complex breakdown would be within the scope of the site.Given the coverage of Android, again, it seems relevant exploring the typical architecture of a Linux distribution, the future architecture, the current Android setup, and the current Chrome OS setup.
I'm hoping for an article, which kinda sucks as I watch pipelines pop by for Microsoft's late C++ efforts, among other failures.
lmcd - Wednesday, July 3, 2013 - link
*at least one article -- didn't make that clearKevin G - Wednesday, July 3, 2013 - link
This is actually an old Windows API issue. While a piece of software can scale to a near infinite number of threads per process (only limited by address space), the Windows scheduler will only run a maximum of 32 per process concurrently. Even MS SQL Server only supports a maximum of 32 threads per DB on a single system (MS SQL Server will spawn another process per DB to scale higher as necessary).Though with 32 real cores, it may pay off to simply disable HyperThreading for better scaling.
mike8675309 - Monday, July 8, 2013 - link
To clarify, this seems to be an issue with 32bit software running on 64bit hardware and making windows API calls while running under WOW64. A good example is noted in the remarks from the API documentation for the GetLogicalProcessorInformationEx function which describes issues with passing a 64bit KAFFINITY structure to a 32bit client and the side effects that can cause.http://msdn.microsoft.com/en-us/library/windows/de...
As noted in the article by the author, creating software that benefits from NUMA rather than being hamstrung by NUMA requires another layer of knowledge on top of single cpu software development. I'm sure Microsoft has figured out NUMA with MS Sql Server considering the prevalence of multi-cpu solutions for that software product essentially since multi-cpu hardware for windows became common. Note TPC result id 112032702 for NEC running Windows Server R2 Enterprise, and SQL Server 2012 Enterprise on 8 processors, 80 cores, and 160 threads.
psyq321 - Friday, July 5, 2013 - link
Windows has no problems with more than 64 logical CPUs since kernel version 6.1The problem is that the >application< itself has to use updated Win32 APIs which allo extend processor mask to be set.
If the application is using old Win32 APIs (pre NT Kernel 6.1) then it will only "see" up to 64 logical CPUs.
psyq321 - Friday, July 5, 2013 - link
By the way, the number 32 as the limit of the number of CPUs seen by the app comes from the 32-bit processes.With Windows:
- 32-bit process has 32-bit processor mask for each thread (DWORD)
- 64-bit process had (pre Win NT 6.1 API) 64-bit mask for each thread (DWORD_PTR)
If application needs to access more than 64 CPUs, it has to use new Win32 APIs that were introduced in Windows 7 / Server 2013
See here: http://msdn.microsoft.com/en-us/library/windows/ha...
The keyword is "processor groups", and APIs that deal with group affinity.
So, I would suggest to the reviewer to get acquainted with this if he intends to keep using Windows Server 2012 (or later) as the test vehicle.
In the Xeon E5 case based on Sandy Bridge-EP this should still not be a problem, as long as the reviewer use 64-bit processes, because Xeon EP 4600 does not support more than 64 logical CPUs.
However, Ivy Bridge EP already can have more than 64 logical processors with the E5 4600 v2 line. Having more than 64 logical CPUs was already possible with Xeon E7 platform based on Boxboro generation, and it will get even more scalable with Ivy Bridge EX.
Jaybus - Monday, July 8, 2013 - link
That processors are grouped is more important than the number of processors. For NUMA architectures, all logical processors belonging to a physical CPU (with or without hyperthreading) will belong to the same group. The SetProcessAffinityMask() Windows function can be used to prevent the scheduler from assigning the process's threads a logical processor that doesn't belong to the same group. This way all threads in that process always run on cores that have the same fast memory access.The process affinity mask essentially allows using a subset of the NUMA hardware as if it were a SMP system. If you have, say, 4 processor groups, then you have to manually divide the data up into 4 sections handled by 4 processes so that each group of threads operates on its own section with SMP memory access. MPI is then used to tie the 4 processes together just like using a cluster. The difference is that the message passing on the NUMA system is faster than on a cluster of separate physical servers, but basically it maps the NUMA system as a cluster of independent SMP systems.
Data dependent algorithms will greatly benefit from using the process affinity mask. Since a system like this doesn't make sense for data independent algorithms, ( where GPU hardware would be faster and cheaper), only software designed for NUMA systems should be compared.
aicom - Wednesday, July 3, 2013 - link
This is exactly why we all aren't running 16+ cores in our desktops. It doesn't make sense for the majority of today's workloads.lmcd - Wednesday, July 3, 2013 - link
This is more a statement of why unified memory and cache are important to performance computing. I'd like to note that the 6-core 3930X beat the 4770K on all but the few single threaded benchmarks, and the Xeon 8-core (I think it's 8-core?) beat the 3930X.There are plenty of applications that scale up with core count. They just don't scale up with multiple sockets and slow interconnects between those cores.
BMNify - Monday, March 17, 2014 - link
bbb but, 25.6 GB/s QPI is supposed to be good <Not in 2014>, we dont need no lowest power stinkin NoC (Network On Chip) at 1Terabit/s,2Terabit/s like those ARM interconnects today"Intel describes the data throughput (in GB/s) by counting only the 64-bit data payload in each 80-bit "flit". However, Intel then doubles the result because the unidirectional send and receive link pair can be simultaneously active. Thus, Intel describes a 20-lane QPI link pair (send and receive) with a 3.2 GHz clock as having a data rate of 25.6 GB/s. A clock rate of 2.4 GHz yields a data rate of 19.2 GB/s. More generally, by this definition a two-link 20-lane QPI transfers eight bytes per clock cycle, four in each direction.
The rate is computed as follows:
3.2 GHz
× 2 bits/Hz (double data rate)
× 16(20) (data bits/QPI link width)
× 2 (unidirectional send and receive operating simultaneously)
÷ 8 (bits/byte)
= 25.6 GB/s
"
floobit - Wednesday, July 3, 2013 - link
I notice that Anandtech tries to appeal to both industrial and enthusiast circles, and I appreciate how hard that is. It seems like this article is targeted at the industrial/HPC segment, however, and I think that a standard benchmark for HPC should include some codes frequently used in HPC. Everyone knows that Gaussian will leave a horse's head on your pillow if you try to benchmark their software, but you could easily run a massive DFT with the parallelized GAMESS, and I've seen previous articles benchmark Monte Carlo codes. Both chemists and wall street types would be interested in that. CFD programs are very popular with engineers; Openfoam is a popular option.MrSpadge - Wednesday, July 3, 2013 - link
Monte Carlo is pretty much the definition of perfect scaling, as there are no dependencies between individual runs / setups / whatever. And you need many of them for statistics anyway.floobit - Friday, July 5, 2013 - link
Yes, Monte Carlo codes are theoretically infinitely parallelizable, though as mentioned previously, often specific implementations do not meet that ideal. Large CFD jobs are also well-parallelizable for some portions of the calculation. 3DS's Abaqus can auto-partitition large models and process each in a separate thread, for instance.THF - Wednesday, July 3, 2013 - link
This rig costs the best part of £16000 + VAT (for example here: http://www.rackservers.com/Configurator.aspx?S=153...I'd really appreciate a comparison with the 4x Opteron 6380, which costs about half the price...
maloman - Wednesday, July 3, 2013 - link
as a scientist myself I would be very interested to see how this scales with standard matrix operations using matlabs parallel computing toolbox. I have noticed that on our grids (xen domains running torque for queuing) the only real speed advantage has been in tesla GPU compute. The CPUs can take care if the overhead of a grid but essentially it comes down to programming as stated in the article. Custom code is the only way, and in most scientific applications the only availability.. Thus, testing high level languages with inherent multiproc support (parfor etc.) would be suuuper interesting to see. Thank you for the great read.Jaybus - Monday, July 15, 2013 - link
Yes. It is problem specific. A data independent operation, such as a matrix multiplication, linear transforms, etc. are far better suited to GPU compute. But consider a problem solved by an iterative calculation where the result of one iteration depends on the result(s) of previous iterations. GPU compute is inherently unsuited for such data-dependent problems. Many real world problems have a data dependence, but the dependence is on a separate calculation that is itself data-independent.Even within the HPC world, the hardware choice depends on the problems the system is to be used for. But to aim for as general purpose a system as can be had, it makes sense to use something like this 4-processor board along with several Tesla cards in its PCIe slots.
So the bottom line is that a HPC benchmark suite should contain a mix of problems. A simple matrix multiply will always be unfairly weighted towards GPU compute and will not be representative of a system's general HPC capabilities.
BMNify - Monday, March 17, 2014 - link
as a scientist if you cant program an optimal assembly routine from your individual C routines as per the optimal x264 coding style to use assembly with a C fallback and check, then at least look to using the far more optimal http://julialang.org/ in place of matlab to increase all your algorithms data throughput, and upstream all your speed/quality improvements to that open code baseZink - Wednesday, July 3, 2013 - link
I like the F@H shoutout. There are certainly more than "a few" users running 4p setups. I'd put it at about 50 to 200 users based on the statistics for how many users are producing at the 500k ppd level most commonly attained with these setups. Many of those users have multiple 4p boards as well.It is not a trivial process to take full advantage of these systems with F@H. The user community has worked to select the ideal Linux kernels and schedulers for this software as well as created custom utilities to improve hardware efficiency. TheKraken is a software wrapper that locks threads from the F@H client to specific CPUs to prevent excessive memory transfer between CPUs. Another user created tool called OCNG is a custom BIOS and software utility that allows Supermicro 4p G34 boards to overclock CPUs and adjust memory timings.
To use the full performance of 4p systems F@H users needed to go much farther than loading up Windows and running the provided executable designed for single CPU systems.
patrickjchase - Thursday, July 4, 2013 - link
From looking at Ian's solver results, I think that there are actually (at least) two problems, and perhaps a third:1. As he acknowledges, he isn't doing any sort of NUMA optimization
2. His overall rates and the obvious senstivity to DDR speed/latency indicate that he probably didn't do much cache-blocking (at its most basic level this involves permuting the order in which elements are processed in order to optimize data access patterns for cache). If that's the case then he would end up going out to DDR more than he should, which would make his code highly sensitive to the latency impacts of NUMA.
3. He may also have some cache-line sharing problems, particularly in the 3D case (i.e. cache lines that are accessed concurrently by multiple threads, such that the coherency protocol "bounces" them around the system). That's the most likely explanation for the absolutely tragic performance of the 4P system in that benchmark.
The importance of cache blocking/optimization can't be overstated. I've seen several cases where proper cache blocking eliminated the need for NUMA optimization. An extra QPI hop adds ~50% to latency in E5-based systems, and that can be tolerated with negligible performance loss if the application prefetches far enough ahead and has good cache behavior.
Ian, would you be willing to share the source code for one or more of your solvers?
patrickjchase - Thursday, July 4, 2013 - link
One additional question for Ian: You state that your finite-difference solvers use "2^n nodes in each direction".Does this mean that the data offsets along the major axis (or axes in the 3d case) are also integer multiples of a large power of 2^n? For example, if you have a grid implemented as a 2D array named 'foo', what is the offset in bytes from foo[0][0] to foo[1][0]?
If those offsets have a large power-of-2 factor, then that would lead to pathological cache behavior and would explain the results you're getting. Experienced developers know to pad such arrays along the minor axis or axes. For example, if I wanted to use a 1024 x 1024 array, I might allocate it as 1024 x 1056 instead. The purpose of the extra 32 elements along each row is to ensure that consecutive rows don't contend for the same cache line.
0ldman79 - Thursday, July 4, 2013 - link
Guys...He has access through Terminal Services.
Exactly how is he going to test video cards and install Linux, hmm?
Nice article, though I confess I will have to read it again after a good nights sleep.
dealcorn - Thursday, July 4, 2013 - link
Supermicro IPMI solutions are generally well regarded and support remote OS installs. I assumed prior familiarity drove OS selection for better or worse.loki1725 - Thursday, July 4, 2013 - link
Really interesting article. I've written several implementations of Finite Difference solvers, and used both COTS and Open Source solvers for parallel machines. I'm really surprised by the results, but I really agree with the conclusion, of you don't write your software appropriately you won't take advantage of the hardware at your disposal.I know it's outside the scope of this article, but I would be really interested to see a comparrison of this 4x processors machine to a 'cluster' of two dual core machines. Ideally it would be awesome to see 2 Sci Linux clusters, one with 4 2x Xeons systems, and 1 with 2 4x Xeon systems. Put the same amount of RAM / core in both rigs and run computational benchmarks. When it comes down to purchasing hardware for a large cluster, looking for the price and performance break point is important. I would imagine that having more threads per machine would be faster then having to run your data over Infiniband (or something like it).
mapesdhs - Thursday, July 4, 2013 - link
Ian, do you have any idea how your code or these tests might run on an SGI UV 20 or
2000, given they have a hardware MPI system and other features to aid with NUMA
systems? The UV 20 is a quad-socket blade with up to 1.5TB RAM, while the 2000
scales to 256 sockets and up to 64TB RAM. They both use the XEON E5-4600 series.
Maybe you could ask SGI if you could do a remote access test on one of their UVs?
Ian.
wishgranter - Saturday, July 6, 2013 - link
Hi All, a letely we do some test on our photogrametric sw, and we stumbled on performance issues with Win2012 Datacenter editionon our DualXeon setups, ( http://www.agisoft.ru/forum/index.php?topic=1330.0 ) in short in W2012 is something not OK with the performance of sw, if we do same test on Win7, or XP the same hw is much more faster, up to 70% ( Hyperthreading stuff ) . Could we do more indepht benchmark/problem solving article put together ?? this could help a lot of people in realworld app usage.....lyeoh - Saturday, July 6, 2013 - link
It's normally silly to use such systems for embarrassingly parallel problems. With those problems you should use multiple far cheaper computers and get more performance for the $$$.These sort of systems are for those "scale vertically" problems.
alpha754293 - Monday, July 8, 2013 - link
You know that there are commercial codes written in MPI available for you to test with as well. And there's a few free ones too.Although you are right though, the transition from 2P to 4P is not as simple and straightforward as the transition from 1P to 2P.
jamesgor13579 - Tuesday, July 9, 2013 - link
In the real world, any heavy threading and computing workload wouldn't be running on Windows. There is a reason that large supercomputers use Linux, its much better at handling large NUMA systems.kgbogdan - Thursday, July 11, 2013 - link
In the future can you please try Linux? I think Linux can do a far better job than Windows. MS Windows Server environment is not that suitable for such benchmarks. And usually for more than 4p Server you use Enterprise Ed not Standard. Sorry, this is just an advice not mandatory, but please try Linuxdeadrats - Thursday, July 11, 2013 - link
a big raspberry has to go to Ian Cutress himself for coming out of the idiot closet with such reckless abandon and to Anand for hiring his worthless ass to write for his site.i want to know what kind of crack one needs to be smoking to get their hands on a 4 way hyperthreaded octo core setup and decide to run a benchmark that uses 720p mpeg-2 as it's source and 4mb/s 720p x264 with the very fast setting (i think that's the one they use) as the target?!?
if you really wanted to stress all the logical cores, a custom benchmark should have been used with all of the x264 settings maxed out and a much higher resolution, maybe even a 4k source, so that we can see some separation between multi cpu and single cpu setups.
seriously, who in there right mind would build this kind of system and then encode 4mb/s 720p avc?
get your head out of your Klavin and learn how to review a damn system.
Rooftop Voter - Saturday, July 13, 2013 - link
Here I come with the stupid question, what does the CP stand for after some of the AMD benchmarks??patrioteagle07 - Monday, July 15, 2013 - link
Linux... Use linux. This is pointless with your usual suite.and you mentioned FAH yet you failed to run it...
Cinebench also scales pretty well...