At my workplace we have a fairly well developed MPI/OpenMP environment. We've dabbled with a Tesla card, but we would like to avoid re-writing everything in OpenCL. Even then, we don't know how long nVidia will support OpenCL.
Excited to see if/when this will actually be released, and since we are a single-precision application, if it can hold a candle to the ridiculous speed the K10 cards are exhibiting.
I migrated my Brownian motion SP code from OpenMP to CUDA quite easily, got a factor 375x speed up over a single Nehalem core using a GTX480, Though tbh, the code was only 1000 lines max and was easier to do than expected.
"So it seems that the Xeon Phi cards consume about 300W."
Each? Or 4 of them put together? Because if that's per-card, I'm not very impressed considering,
"For comparison, a quadcore Haswell at 4 GHz will deliver about one fourth of that in 2013."
For 300W, you can put together on the order of 10 Haswell quad-cores! That'd give you about 2.5x the max theoretical performance for the same wattage as the Xeon Phi (and, I'd imagine for a fraction of the cost as well...)
Very valid points. However, I don't have any measurement nor real benchmarks yet. The 300W is - to my understanding - the upper limit. The last time I tested, Linpack can make a CPU consume 30-35% more than a typical integer application, both running at 100% CPU load.
Do you have more info about the 2Ghz frequency ? It seems very high for that kind of chip. Maybe the 1 TFlops in double precision can be achieved with an FMA instruction (considered as 2 floating point operations) : 1GHz * (512/64) * 64 cores * 2 ops per cycle
Johan De Gelas, thank you very much for your article.
Aren't some of the 64 cores disabled? Previously it was reported that there would be between 51 and 64 CPU cores per Xeon Phi SoC.
Are you sure there are only 8 double precision flops per core per clock or 2 double precision flops per thread per clock? If so the theoretical max assuming 64 cores at 2 gigahertz is:
[2 Gigahertz]*[512 flops/clock] = 1.024 double precision teraflops. Actual performance is always below theoretical performance. You would need quite a bit more than 2 Gigahertz to hit 1 teraflop double precision.
2 Gigahertz is confirmed? Pretty amazing for a 64 core SoC.
How many single precision flops per core per clock?
Please confirm that a 64 CPU core Xeon Phi SoC only has a TDP of 75 watts. Can that be right?
Intel's SIMD units are designed in such a way that performance scales inversely with precision, so FP64 is half the performance of FP32 which is half of FP16.
Theoretical performance is what is talked about when calculating for any card, the performance is there, you just have to use it :)
Your numbers are also not quite correct, you missed out 2ops/clock from your calculation making the performance 2TFLOPS.
Can someone confirm that there are 16 double precision flops per core per cycle? Or 4 double precision flops per thread per cycle?
This would mean that at 2 Gigahertz:
2 Gigahertz * (64 cores)*(4 threads/core)*(4 double precision flops per thread/clock) = 2 double precision teraflops theoretical maximum speed.
Actual speed would be quite a bit less than 2 Terahertz double precision. If we assume 70% efficiency [completely pulled out of nothing], we would get 1.4 Terahertz double precision.
Is there confirmation that this is true (aside from the efficiency estimate since I doubt Intel has released that information yet)? Is there also confirmation that each Xeon Phi SoC only has a TDP of 75 watts? If so that is astounding.
This would mean that the whole system generates: 1.4 double precision gigaflops/75 watts = 19 double precision gigaflops/watt.
The threads don't add any more peak flops performance. They're here only to approach this performance peak. 4 threads per core means there is 4 complete sets of registers in each core. For example, if a thread, currently executed, doesn't use all the unit of the core, another thread can use it. So two thread can't use the same resource that an other thread in the same time but if the resources (number of ALU, FPU, decode and dispatch units etc) per core is still the same, its use it more efficient.
So for me it's a 1GHz (maybe a little more) chip with 64 cores. Each could run a Fused Multiply and Add instruction (like on the future AVX2 instruction set of Haswell). It means 2 instructions/cycle on 512bits (so 8 double precision floats) = 1TFlops in double precision peak performance (2*64*8). So maybe the frequency is a little more than 1Ghz to achieve the 1TFlops in double precision on LINPACK like they said. But with this kind of architecture and the 4 threads/core, the real performance won't be that far from the theoretical performance unlike the GPU where it's about 60%.
I just checked on my Ivy Bridge processor, and I can reach the theoretical performance peak with the Intel Linpack Benchmark (http://software.intel.com/en-us/articles/intel-mat... I have 82 GFlops in double precision. The theoretical perf are 8 double precision floats / cycles. At 2.6GHz (3720QM) on 4 cores it's 83.2. So I'm now pretty sure that it will be the same with Xeon Phi. And the frequency will be 1GHz.
I don't think that the power consumption will be only 75W per card. If you remove the power for the RAM, it will means around 1 Watt/ core. It's the power consumption of an ARM core. I think it's more 3-4 Watts/ core.
1 core has several threads, but that is just to keep the flow going. For FLOPs, you should focus on the vector unit, no the pipeline or threads. So each vector unit can do 8 DP, not 16. Core is around 2 GHz
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
15 Comments
Back to Article
nutgirdle - Tuesday, September 11, 2012 - link
At my workplace we have a fairly well developed MPI/OpenMP environment. We've dabbled with a Tesla card, but we would like to avoid re-writing everything in OpenCL. Even then, we don't know how long nVidia will support OpenCL.Excited to see if/when this will actually be released, and since we are a single-precision application, if it can hold a candle to the ridiculous speed the K10 cards are exhibiting.
IanCutress - Wednesday, September 12, 2012 - link
I migrated my Brownian motion SP code from OpenMP to CUDA quite easily, got a factor 375x speed up over a single Nehalem core using a GTX480, Though tbh, the code was only 1000 lines max and was easier to do than expected.boeush - Tuesday, September 11, 2012 - link
"So it seems that the Xeon Phi cards consume about 300W."Each? Or 4 of them put together? Because if that's per-card, I'm not very impressed considering,
"For comparison, a quadcore Haswell at 4 GHz will deliver about one fourth of that in 2013."
For 300W, you can put together on the order of 10 Haswell quad-cores! That'd give you about 2.5x the max theoretical performance for the same wattage as the Xeon Phi (and, I'd imagine for a fraction of the cost as well...)
JohanAnandtech - Tuesday, September 11, 2012 - link
Very valid points. However, I don't have any measurement nor real benchmarks yet. The 300W is - to my understanding - the upper limit. The last time I tested, Linpack can make a CPU consume 30-35% more than a typical integer application, both running at 100% CPU load.cmikeh2 - Tuesday, September 11, 2012 - link
While you probably could put together 10 Haswell quad-cores, at around 300 W,I doubt they would be running at 4 GHz.ArCamiNo - Tuesday, September 11, 2012 - link
Do you have more info about the 2Ghz frequency ?It seems very high for that kind of chip. Maybe the 1 TFlops in double precision can be achieved with an FMA instruction (considered as 2 floating point operations) :
1GHz * (512/64) * 64 cores * 2 ops per cycle
codedivine - Tuesday, September 11, 2012 - link
Agree with you.1008anan - Tuesday, September 11, 2012 - link
I also to think 16 double precision flops per core per cycle or 4 double precision flops per thread per clock * 1 gigahertz.I am surprised to learn that there are only 8 double precision flops per core per clock or 2 double precision flops per thread per clock.
Running a 64 CPU core SoC at 2 gigahertz is astounding.
djgandy - Wednesday, September 12, 2012 - link
I'd agree. Peak rates are usually quoted using FMA.At 2Ghz I'd expect 2TFLOPS too, or 4TFLOPS in 32-bit which would be consistent with Larrabee numbers.
1008anan - Tuesday, September 11, 2012 - link
Johan De Gelas, thank you very much for your article.Aren't some of the 64 cores disabled? Previously it was reported that there would be between 51 and 64 CPU cores per Xeon Phi SoC.
Are you sure there are only 8 double precision flops per core per clock or 2 double precision flops per thread per clock? If so the theoretical max assuming 64 cores at 2 gigahertz is:
[2 Gigahertz]*[512 flops/clock] = 1.024 double precision teraflops. Actual performance is always below theoretical performance. You would need quite a bit more than 2 Gigahertz to hit 1 teraflop double precision.
2 Gigahertz is confirmed? Pretty amazing for a 64 core SoC.
How many single precision flops per core per clock?
Please confirm that a 64 CPU core Xeon Phi SoC only has a TDP of 75 watts. Can that be right?
djgandy - Wednesday, September 12, 2012 - link
Intel's SIMD units are designed in such a way that performance scales inversely with precision, so FP64 is half the performance of FP32 which is half of FP16.Theoretical performance is what is talked about when calculating for any card, the performance is there, you just have to use it :)
Your numbers are also not quite correct, you missed out 2ops/clock from your calculation making the performance 2TFLOPS.
1008anan - Wednesday, September 12, 2012 - link
This is what I thought digandy.Can someone confirm that there are 16 double precision flops per core per cycle? Or 4 double precision flops per thread per cycle?
This would mean that at 2 Gigahertz:
2 Gigahertz * (64 cores)*(4 threads/core)*(4 double precision flops per thread/clock) = 2 double precision teraflops theoretical maximum speed.
Actual speed would be quite a bit less than 2 Terahertz double precision. If we assume 70% efficiency [completely pulled out of nothing], we would get 1.4 Terahertz double precision.
Is there confirmation that this is true (aside from the efficiency estimate since I doubt Intel has released that information yet)? Is there also confirmation that each Xeon Phi SoC only has a TDP of 75 watts? If so that is astounding.
This would mean that the whole system generates:
1.4 double precision gigaflops/75 watts = 19 double precision gigaflops/watt.
Can this be right?
ArCamiNo - Thursday, September 13, 2012 - link
The threads don't add any more peak flops performance. They're here only to approach this performance peak.4 threads per core means there is 4 complete sets of registers in each core.
For example, if a thread, currently executed, doesn't use all the unit of the core, another thread can use it.
So two thread can't use the same resource that an other thread in the same time but if the resources (number of ALU, FPU, decode and dispatch units etc) per core is still the same, its use it more efficient.
So for me it's a 1GHz (maybe a little more) chip with 64 cores. Each could run a Fused Multiply and Add instruction (like on the future AVX2 instruction set of Haswell). It means 2 instructions/cycle on 512bits (so 8 double precision floats) = 1TFlops in double precision peak performance (2*64*8).
So maybe the frequency is a little more than 1Ghz to achieve the 1TFlops in double precision on LINPACK like they said. But with this kind of architecture and the 4 threads/core, the real performance won't be that far from the theoretical performance unlike the GPU where it's about 60%.
ArCamiNo - Thursday, September 13, 2012 - link
I just checked on my Ivy Bridge processor, and I can reach the theoretical performance peak with the Intel Linpack Benchmark (http://software.intel.com/en-us/articles/intel-mat...I have 82 GFlops in double precision. The theoretical perf are 8 double precision floats / cycles. At 2.6GHz (3720QM) on 4 cores it's 83.2.
So I'm now pretty sure that it will be the same with Xeon Phi. And the frequency will be 1GHz.
I don't think that the power consumption will be only 75W per card. If you remove the power for the RAM, it will means around 1 Watt/ core. It's the power consumption of an ARM core. I think it's more 3-4 Watts/ core.
JohanAnandtech - Monday, September 17, 2012 - link
Seems like we are not far from the mark.1 core has several threads, but that is just to keep the flow going. For FLOPs, you should focus on the vector unit, no the pipeline or threads. So each vector unit can do 8 DP, not 16. Core is around 2 GHz
We reported 300W per card, and Charlie is reporting about 200W on idle. So 300W maxing out seems very reasonable to me.
http://semiaccurate.com/2012/09/14/hard-numbers-fo...
Right?