Name: The Xeon Phi at work at TACC
Item: The Xeon Phi at work at TACC
Author: Johan De Gelas

The Xeon Phi at work at TACC

by Johan De Gelas on 11/14/2012 1:44 PM EST

Post Your Comment
Please log in or sign up to comment.

Comments Locked

46 Comments

Back to Article

tipoo - Wednesday, November 14, 2012 - link
I wonder if we'll ever have more numerous smaller cores like these working in conjunction with larger traditional cores. A bit like the PPE and SPEs in the Cell processor, with the more general core offloading what it can to the smaller ones.
A5 - Wednesday, November 14, 2012 - link
That's called heterogeneous computing. It's definitely where things are going in the future and you can argue that it's already here with Trinity.
nevertell - Wednesday, November 14, 2012 - link
The great thing about the Cell was that both the PPE and the SPEs had access to the same memory. Trinity doesn't and while that may be because there isn't an OS that would take advantage of that, hardware is as capable as software is efficient for that exact hardware solution.

There is no need for major parallelism in the consumer space, since nobody is willing to rewrite their programs to run on something faster whilst the general public is already served well enough by a Core i3 or i5.
name99 - Friday, November 16, 2012 - link
"The great thing about the Cell was that both the PPE and the SPEs had access to the same memory."

Hmm. This is not a useful statement.

Cell had a ludicrous addressing model that was clearly irrelevant to the real world. It's misleading to say that the cores had access to "the same memory". The way it actually worked was that each core had a local address space (I'm think 12bit wide, but I may be wrong, maybe 14 bits wide) and almost every instruction operated in that local address space. There were a few special purpose instructions that moved data between that local address space and and the global address space. Think of it as like programming with 8086 segments, only you have only one data segment (no ES, no SS), you can't easily swap DS to access another segment, and the segment size is substantially smaller than 64K.

Much as I dislike many things about Intel, more than anyone else they seem to get that hardware that can't be programmed is not especially useful. And so we see them utilizing ideas that are not exactly new (this design, or transactional memory) but shipping them in a form that's a whole lot more useful than what went before.
This will get the haters on all sides riled up, but the fact is --- this is very similar to what Apple does in their space.
dcollins - Wednesday, November 14, 2012 - link
That's exactly how this supercomputer, and all supercomputers offering accelerated compute, work. Xeon or Opteron CPUs handle complex branching tasks like networking and work distribution while the accelerators handle the parallelizable problem solving work.

Merging them onto a single die is simply a matter of having enough die space to fit everything while making sure that economics of a single chip is better than separate products.
tipoo - Wednesday, November 14, 2012 - link
*in consumer computing I mean.
Gigaplex - Wednesday, November 14, 2012 - link
Both AMD Fusion and Intel Ivy Bridge support this right now. The software just needs to catch up.
tipoo - Wednesday, November 14, 2012 - link
Sort of I suppose, but I think something like this would be easier to use for most compute tasks for the reasons the article states, these are still closer to general processor cores than GPU cores are.
frostyfiredude - Wednesday, November 14, 2012 - link
Something like ARM's big.LITTLE in a sense seems like a good idea to me. I'm not sure how feasable it is, but having one or two small Atom-like cores paired to larger and more complex Core processing cores all sharing the same L3 sounds like a decent idea for mobile CPUs to cut idle power use. My guess is the two types of cores would need to share the same instructions, so the differences would be things like OoO vs In-order, execution width, designed for low clock speed vs high clock speed. The Atom SoCs can hit power use around that of ARM SoCs, so if Intel can get that kind of super low power use at low loads and ULV i7 performance out of the same chip when stressed that'd be super killer.
CharonPDX - Thursday, November 15, 2012 - link
One rumor I had heard upon Larrabee getting cancelled and turned into Knights Ferry was that this technology might be released as a coprocessor that used the same socket as the "main" Xeon.

That you could mix-and-match them in one system. If you wanted maximum conventional performance, you put in 8 conventional Xeons. If you wanted maximum stream performance, you'd put in one "boot" conventional Xeon, and 7 of these. (At the time, there were also rumors that Itanium was going to be same-socket-and-platform, which now looks like it will come true.)
Kevin G - Saturday, November 17, 2012 - link
That rumor has a grain of truth to it. A slide deck about Larrabee from Intel indicated a socketable version fitting into a quad socket Xeon motherboard. This was while Intel still had consumer plans for Larrabee which have since radically changed.

Source:
http://arstechnica.com/gadgets/2007/06/clearing-up...
alpha754293 - Wednesday, November 14, 2012 - link
I don't even know what generally (and publically accessible) programs are available that you would be able to use to do this sort of HPC testing.

OpenMP code is sort of "easier" to come by. A program that has both an OpenMP and a CUDA version where it's a straight port - I can't even think of one.

The only one that might be a possiblity would be Ansys 13/14 because they do have some limited static structural/mechanical FEA capabilities that can run on the GPU, but I don't know how you'd be able to force it onto the Xeon Phis.

Hmmm....
TeXWiller - Wednesday, November 14, 2012 - link
The next version of OpenMP should have accelerator suppport via the OpenACC scheme. I'd bet that most engineering applications will be able to support most accelerators like Phi, Tesla and APUs in a transparent manner simply through the math libraries, not perhaps in the most optimal but at least in a sufficiently worthwhile way.
rad0 - Wednesday, November 14, 2012 - link
One thing I've yet to understand about the Xeon Phi is: do you get to run anything you want on it, or not?

Could you run Oracle's JVM (or any other JVM) on it? I know HPC isn't all that interested in Java, but a cheap 60-thread Java machine would be very interesting to play with.

Can you just ssh into the embedded linux and run anything you want?
coder543 - Wednesday, November 14, 2012 - link
Why Java? A dozen negative adjectives pop into my mind at the mere mention of the word outside of a coffee shop.
madmilk - Wednesday, November 14, 2012 - link
You can probably run Java on it, but it will not run well. Most Java code is application code - very branchy, something the Phi's memory architecture cannot handle well. The JVM certainly will not vectorize code either, so you have all those vector units being wasted.

This is really much closer to a GPU in terms of the kind of optimizations that must be done for performance, even if the underlying instruction set is x86.
Jaybus - Thursday, November 15, 2012 - link
No, it is much closer to a CPU than a GPU. This is an area where it differs VASTLY from a GPU. In fact, the cores are CPUs.
llninja1 - Thursday, November 15, 2012 - link
According to Tom's Hardware, you can login to the Xeon Phi card and get a command line prompt

http://www.tomshardware.com/reviews/xeon-phi-larra...

so that implies you can do whatever you want with some finagling. Whether your 60-thread JVM thought would work well or not on this architecture remains to be seen.
extide - Wednesday, November 14, 2012 - link
Do some Folding@Home benchmarks on a Phi if at all possible!

Thanks!
tipoo - Wednesday, November 14, 2012 - link
Like the people in charge of F@H would develop and release a new folding core so that it could run on one of these in the off chance some enthusiast has one of these multi thousand dollar cards and a computer system that can run it?

Not going to happen. This isn't a general CPU core that any existing software can run on, nor is it aimed at home users.
SodaAnt - Wednesday, November 14, 2012 - link
It does support the x86 instruction set though, so it shouldn't be too hard to port.
MrSpadge - Wednesday, November 14, 2012 - link
But you have to use the custom vector format to stampede anything.
Kevin G - Saturday, November 17, 2012 - link
In theory it should run the current the Linux version of F@H without modification. That catch is that the current version is going to be horribly suboptimal as it doesn't natively support the 512 bit wide vector format used by the Xeon Phi. This would leave only the x87 FPU for calculations. This would allow the 60 scalar FPU's to be used but limit performance to a mere 60 GFLOP across all the cores. There maybe some weird scheduling oddities with Linux and/or F@H due to the chips ability to expose 240 logical processors to the host OS (the result would be better performance from running multiple instances in parallel instead of one large instance using 240 threads).

An OpenCL version of F@H might be coaxed to working and it that would utilize the 512 bit vector units. Intel would have to have OpenCL drivers available for this to even have a chance of working. This would allow the full ~1 TFLOP performance to be utilized.
SydneyBlue120d - Wednesday, November 14, 2012 - link
Why did Intel choose a custom SIMD format? Why not AVX?
Jaybus - Thursday, November 15, 2012 - link
Because they needed heavier duty vector units. Each Phi core has 32 512-bit registers, where Core i7 has 16 256-bit registers. They just didn't implement the backward compatibility, probably to reduce complexity. It is certainly possible to do, and we may indeed see AVX, SSE, etc. added in a future revision.
Kevin G - Saturday, November 17, 2012 - link
The 512 bit vector instructions change how exceptions and the register masking are handled in comparison to AVX. Outside of that, the vector instructions are similar to how AVX instructions are formatted and the output complies with IEEE floating point standards. So while there is a distinct break in ISA capabilities, it does appear that it is possible to bridge the two together in future designs. Still it is odd that Intel has forked their ISA.
coder543 - Wednesday, November 14, 2012 - link
I just want to know how much it will cost.

Why is Intel keeping this such a ridiculous secret? Knowing Intel, these will easily be $2,000+ a piece, if not much higher, but I still want to *know.*
LogOver - Wednesday, November 14, 2012 - link
Did you read the article at all? Check the second page again.
Comdrpopnfresh - Wednesday, November 14, 2012 - link
How could PCIe 3.0 result in more overhead?
nutgirdle - Wednesday, November 14, 2012 - link
I concur. A major dis-advantage to co-processor computing is the time it takes to move data on and off the card. The PCIe 2.0 bus is already a bottleneck in our workflow involving a Tesla card. This was a very short-sighted omission.
Kevin G - Saturday, November 17, 2012 - link
It is dependent upon the bus encoding. PCI-E 1.0/2.0 use an 8/10 encoding scheme to handle traffic while PCI-E 3.0 uses 128/130 encoding. PCI-E only increases the clock speed of the bus by 66% with the rest of the bandwidth increase stemming from the more efficient encoding schema. Xeon Phi seems to have kept the PCI-E 1.0/2.0 encoding but supports the higher clock rate of PCI-E 3.0. This appears to be nonstandard but the LGA 2011 Xeons appear to support this for additional bandwidth.

Any overhead is likely adding full PCI-E 3.0 support in addition to PCI-E 1.0/2.0.
mayankleoboy1 - Thursday, November 15, 2012 - link
Assuming you can buy a single Xeon phi card, can it work in desktop motherboards and processors ?
Can it work with AMD processors ? Can it work in tandem with Nvidia and ATI GPU's ?
Joschka77 - Thursday, November 15, 2012 - link
i think the answers would be: Yes, Yes, no! ;-)
Jaybus - Thursday, November 15, 2012 - link
No, it is yes, yes, and yes. The Stampede also uses 128 NVIDIA Tesla K20 GPUs, as stated in the article.
Kevin G - Saturday, November 17, 2012 - link
That's with in the cluster, not necessarily in the same host system. I strongly suspect that the visual nodes featuring K20 GPU's are isolated from the Xeon Phi nodes.
maximumGPU - Thursday, November 15, 2012 - link
can't deny that openMP code that automatically runs faster on the phi would represent a great solution for those looking for the speed up without the cost and time of modifying code for gpus. There certainly is a market to cater for with these cards.
creed3020 - Thursday, November 15, 2012 - link
Johan,

The numbers you describe as to configuration of the units doesn't add up.

Eight of the compute sleds plus two PSU sleds cannot fit into a 4U unit. Judging by the photo it appears that in this vertical configuration of nodes it goes something like this: {Compute, Compute, PSU, PSU, Compute, Compute } 4U. It appears this leads to a total of 5 chassis per cluster within the rack.

This is then compounded with two distinct clusters per rack with their own Infiniband switch + regular Ethernet switch. This makes for a total of 10 C8000 chassis per rack.

This makes sense when considering a 48U rack. 22U per cluster x 2 clusters per rack = 44U with a few spare slots at the top and middle. I could two at top and one in the middle.
llninja1 - Thursday, November 15, 2012 - link
I think the author got some Dell facts mixed up. Looking at Dell.com you can fit 8 compute sleds in, but those compute sleds are half width and don't contain the necessary double wide PCIe slots to accommodate a Xeon Phi card. So in a single 4U unit, you are correct it can only hold 4 compute sleds and two power units as depicted in the Stampede picture.
creed3020 - Friday, November 16, 2012 - link
Thanks for the explanation. I didn't go over to the Dell site but it would explain that a slimmer sled is possible if you don't have to stick one these huge 2 slot Xeon Phi cards in.

It makes me wonder how big a difference there is in total PFLOPS/rack when configured with the half height sled vs. full height sled with Phi.
mfilipow - Thursday, November 15, 2012 - link
"Eight of those server sleds find a home inside the C8000 4U Chassis, together with two power sleds." did you write - but on the photo I only see 4x computing sleds plus 2x power. Where are the other 4?!
creed3020 - Friday, November 16, 2012 - link
There are only 4 per row in the chassis because these units in Stampede feature the Xeon Phi card which requires a bigger sled. The author got the potential specs messed up with the way they are actually configured for this supercomputer.
GullLars - Thursday, November 15, 2012 - link
So, it seems these are great at general purpose supercomputing.
How do they stack up against the latest FPGAs if they are set up carefully by the people who will be running a specialized problem on them?
And would these be able to work effectively with offloading of some key functions that would be able to work 20-100x faster (or power efficient) on a carefully set up FPGA?

Some people in the comments mentioned hetrogenous computing. A step on the way is modular accelerated code. I'm interrested to see if we get more specialized hardware for acceleration in the comming years, not just graphics (with transcoding) and encryption/decryption like is common in CPUs now. Or if we get an FPGA component (integrated or PCIe) that can be reserved and set up by programs to realize huge speedups or power savings.
Jameshobbs - Monday, November 19, 2012 - link
Why have there not been a lot of reports regarding the PCI express. This was the first source that I was able to find that even mentions the speed of the PCI e bus for the Xeon Phi.

One of the most challenging things for programming on accelerators is handling the PCI express and trying to balance data transfer with computational complexity. Everyone, NVIDIA, Intel, AMD seem to be doing a lot of arm waving regarding this issue, and there are many GPU papers that tend to omit the transfer times in their results. To me I find this dishonest and cheating.

One thing that continues to shock me as well is that people keep complaining about how difficult it is to debug a GPU program and then they reference old out of date references such as http://lenam701.blogspot.be/2012/02/nvidia-cuda-my... which was mentioned above. The things that the author of that blog post complained about have been resolved in the latest versions of CUDA (from 4.2 onward... maybe even in 4.0).

Programmers can now use printf and it is possible to hook a debugger into a GPU application to do more in depth debugging. The main thing that bothers me about GPU programming is you must check to make sure a program has successfully completed or not. Other than that I find it relatively easy to debug a GPU application.
MySchizoBuddy - Wednesday, November 21, 2012 - link
Next version of AMD APU will allow both the GPU and CPU access to the same memory locations.
sheepdestroyer - Wednesday, December 5, 2012 - link
i would really like to see a benchmark of this cpu on LLVMpipe
http://www.mesa3d.org/llvmpipe.html
The original Larabee would have had a DirectX translation layer and this project could be seen as an OpenGL version of it.
Just loading a distro with Gnome 3 running on LLVMpipe or benchmarking some ioq3 and iodoom3 games would be VERY interesting.
tuklap - Sunday, March 3, 2013 - link
Can this accelerate my normal PC applications like Rendering in AutoCAD/Revit, Media Conversion, STAAD, ETABS and etc computations???

or do i Have to create my own applications?

The Xeon Phi at work at TACC

Post Your Comment

46 Comments

Back to Article

tipoo - Wednesday, November 14, 2012 - link

A5 - Wednesday, November 14, 2012 - link

nevertell - Wednesday, November 14, 2012 - link

name99 - Friday, November 16, 2012 - link

dcollins - Wednesday, November 14, 2012 - link

tipoo - Wednesday, November 14, 2012 - link

Gigaplex - Wednesday, November 14, 2012 - link

tipoo - Wednesday, November 14, 2012 - link

frostyfiredude - Wednesday, November 14, 2012 - link

CharonPDX - Thursday, November 15, 2012 - link

Kevin G - Saturday, November 17, 2012 - link

alpha754293 - Wednesday, November 14, 2012 - link

TeXWiller - Wednesday, November 14, 2012 - link

rad0 - Wednesday, November 14, 2012 - link

coder543 - Wednesday, November 14, 2012 - link

madmilk - Wednesday, November 14, 2012 - link

Jaybus - Thursday, November 15, 2012 - link

llninja1 - Thursday, November 15, 2012 - link

extide - Wednesday, November 14, 2012 - link

tipoo - Wednesday, November 14, 2012 - link

SodaAnt - Wednesday, November 14, 2012 - link

MrSpadge - Wednesday, November 14, 2012 - link

Kevin G - Saturday, November 17, 2012 - link

SydneyBlue120d - Wednesday, November 14, 2012 - link

Jaybus - Thursday, November 15, 2012 - link

Kevin G - Saturday, November 17, 2012 - link

coder543 - Wednesday, November 14, 2012 - link

LogOver - Wednesday, November 14, 2012 - link

Comdrpopnfresh - Wednesday, November 14, 2012 - link

nutgirdle - Wednesday, November 14, 2012 - link

Kevin G - Saturday, November 17, 2012 - link

mayankleoboy1 - Thursday, November 15, 2012 - link

Joschka77 - Thursday, November 15, 2012 - link

Jaybus - Thursday, November 15, 2012 - link

Kevin G - Saturday, November 17, 2012 - link

maximumGPU - Thursday, November 15, 2012 - link

creed3020 - Thursday, November 15, 2012 - link

llninja1 - Thursday, November 15, 2012 - link

creed3020 - Friday, November 16, 2012 - link

mfilipow - Thursday, November 15, 2012 - link

creed3020 - Friday, November 16, 2012 - link

GullLars - Thursday, November 15, 2012 - link

Jameshobbs - Monday, November 19, 2012 - link

MySchizoBuddy - Wednesday, November 21, 2012 - link

sheepdestroyer - Wednesday, December 5, 2012 - link

tuklap - Sunday, March 3, 2013 - link

Log in

Don't have an account? Sign up now