Thanks Ian! Isn't one of the key differences of this new supercomputer to most others in the top 5 that it doesn't rely on GPU-like accelerators for its speed ? That makes setups like Fugaku more broadly usable, at least as far as I know or was told.
If SVE doesn't have inter-lane operations, then I don't see it being materially different than a GPU. To get good performance out of it, you're going to have to program it much like one.
Don't disagree, except that this setup can actually run programs that aren't specifically written for GPU (or wide SVE), where accelerator-based systems often simply cannot. And, at least according to people who know this much better than I do, the need to have a program (and a problem) that is suited to limited routines (GPU-type or SVE) can be a real monkey wrench when you want a solution for a problem really quickly. Also, how long does it take to customize a computational approach to an accelerator even if it can be done? I can see how the time savings from the use of a supercomputer could be eaten up by the delay in getting the program ready rather frequently.
The main advantage of this sort of layout is the FPUs should have access to the CPU registers. Same with the cache, at the same speeds as the rest of the execution resources.
Also, hopefully less context switching penalty (or failed branch penalty) than a traditional GPU.
But SVE *does* have inter-lane (shuffle/permute) instructions; how Fujitsu has implemented those instructions (pipelined or in microcode) remains an open question, but SVE does support them.
Of course SVE has inter-lane operations! Where do people pick up this nonsense?
Obviously it has the usual reductions, but it also has a variety of standardized shuffles that generalize to length agnostic (eg interleaves) AND the TBL (table lookup) instruction which is the one (and only, as far as I know) none-length-agnostic instruction -- but it is there if you need generic permutes and are willing to code to that particular machine and nothing else.
I said "if", based on the assumption that they were trying to do pure SIMD with it, in order to try and compete with GPUs. It was literally the first word. No reason you can't see it there.
I think GPUs don't support inter-lane operations because they could involve rather a lot of silicon, for large vectors. Not simple interleaving, but things like horizontal arithmetic and arbitrary shuffles.
You could certainly argue that the SIMD units are GPU-like, but yeah, it's a homogeneous architecture. It's really, really nice that they're even attempting that in the current year, I really hope it works out well for them. I'm also curious what the practical differences really are vs. Xeon Phi (which on paper is quite similar with 2x512-bit FMAs per core) and why they expect this to work out better. Better intetconnect?
The nodes are thin by modern standards (only 3.4 TFLOPS and 32 GB memory), so the interconnect really has to be good. A large simulation will be spread over a huge number of nodes and there will be a lot of MPI communication.
With regard to Xeon Phi: exactly. It seems to me they're basically counting on the benefits of a superior ISA, but it's still not going to beat GPUs at their own game.
I think the interconnect, as well as the broader system topology, is a big part of it. I also think that a more apt comparison might be to IBM's (now ancient) Blue Gene-Q processor: a many-core processor with integrated many-port networking, scaling to large node-counts with an unswitched (peer-to-peer) network topology, and (relatively) low per-node power usage.
Very cool. I find it interesting that Mervell isn't going the SVE route for their 4x128 FPU in the ThunderX3. There any specific reason for Fujitsu pursuing this? Or was this a flexible ISA extension given to largely (help?) move Fujitsu away from the dying (but open) SPARC ISA?
SVE gives Fujitsu access to features they had in their proprietary HPC-ACE ISA before; going back to NEON would have been a massive regression. I also expect that ThunderX4 is going to be SVE-capable. Marvell has said SVE support is likely to be coming in the future.
Fujitsu, who has their own compilers and profiling/optimization, starts; the rest will follow as the ecosystem develops. There are indications from EPI announcements that next-gen Neoverse is going to be SVE-capable too, for instance. Huawei's server CPU roadmap also includes an SVE-capable microarchitecture in the future.
Probably because they realized that trying to beat GPUs at their own game is a fool's errand. See my other comment (below) about Fukagu's worse power-efficiency compared with Summit.
As a whole system it's still 3 times the performance for 3 times the power. Pretty much identical power/TFLOP. The efficiency of these cores seems to be identical with the combination of POWER and Nvidia cores.
Not just not half-bad on efficiency, but also a lot more versatile. Now, I can't program for any of these, but was told by people who are using supercomputers (or fractions of runtimes, to be precise) that there are plenty of situations where it's actually highly desirable to have "just" a whole bunch of really powerful CPU instances to program for. I also believe that was one of the stated goals of Riken when they commissioned Fugaku. GPU and NPU accelerators can be extremely effective, but they are more limited in what they can do. My own, simple minded explanation is that's why we still have CPUs in our PCs; the dGPU is much faster at its tasks, but the CPU can do pretty much anything you can program for. Otherwise, why bother with a CPU?
Which other CPU-only system uses purpose-made parts these days? There were some IBM's projects in the past but this is the only recent processor for HPC so it is naturally the best.
Calculating TFLOPS/W with just Linpack performance gives a rather one-sided view. In HPCG Fugaku is more than twice as efficient as Summit (assuming similar power consumption as in HPL; they don't seem to give power usage in HPCG).
Still, you're comparing with old tech. The real test will be to see how it compares with Nvidia's A100, which seems to be much more adept at dealing with things like sparsity.
Nvidia's own A100 installation is included in the list and it's not much better than computers using V100 in HPCG (in fact, ratio of HPCG to peak performance seems to be a bit worse).
That said, Fugaku's HPCG-to-peak ratio is high also compared to other CPU-only systems, so maybe it's more that the custom interconnect is really good for the problem.
As another comparison point, Fugaku is also tops (by quite a lot) the list for Graph500, for which Oak Ridge Summit only even submitted a CPU-Only run. I don't see any GPU runs at all, with the closest match being some KNL-based systems and the chinese Sunway machine, which has their own custom KNL-equivalents. Supposedly the new A100 GPUs are supposed to be a lot better at Graphs, but it's pretty telling on architecture flexibility that there are no GPU submission to date for this.
And based on the number of nodes and cores used for Fugaku in the submission, it only used a bit over half of the full supercomputer.
> they were so keen on getting the supercomputer up and running to assist with the R&D as soon as possible – the server racks didn’t have their official front panels when they started working.
From what I remember, yes, that's pretty much the story. Fujitsu developed this chip to have a follow-up to SPARC; it probably didn't help that Oracle finally killed SPARC off after strangulating it for years.
You don't think it's sigificant that you can hit the same (Linpack) performance/watt as a GPU, but on a system that's a much more "traditional" architecture (ie easier to port to, easier to match to a variety of different algorithms)?
Which gets at the second point: Linpack, ie dense linear algebra, is well known to be a terrible metric of this type of machine -- if what you want is dense linear algebra, you can do far better on dedicated hardware (that's why everyone is adding TPU's to their designs...). Much more interesting is the performance on something that's not quite so trivial. There Fujitsu gets closer to 4.6x the IBM/nVidia result. So roughly they're 30% more efficient than IBM/nV for more "generic" code.
The most important aspect of this system is its HBM (High Bandwidth Memory) - broad, very fast, connections to fast memory chips. These need to be physically close to the CPU chip. There are no DIMMs in this system.
Main memory bandwidth has been the biggest bottleneck for most computing for decades, other that which can be done with GPUs.
Does this device have a single memory space, or do the four quadrants of 12 (13) cores each have their own space. If the former, then there will be a big latency and bandwidth restriction accessing data in the HBM chips of the three other quadrants. If the former, then the total RAM available to programs is set by the size of the two HBM chips.
Can anyone point to more details on the HBM used in Fugaku?
Yeah, Single Board Computers are not exactly common place in your typical commercial space or even much of the professional industry as far as I can tell. I've only actually come across them in the super computing space, avionics, and a handful of other embedded systems scenarios. I might have made the same mistake.
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
46 Comments
Back to Article
TeXWiller - Monday, June 22, 2020 - link
I wouldn't call these accelerator cards. The PCIe lanes are there to connect to a local IO and management for example, not to a host CPU.SarahKerrigan - Monday, June 22, 2020 - link
Indeed. There's no host - the entire software stack runs on the A64FX nodes themselves. This is a 100% CPU-only, accelerator-free system.jeremyshaw - Monday, June 22, 2020 - link
I think the writer may have confused these with the NEC Aurora cards.jeremyshaw - Monday, June 22, 2020 - link
Not overall, but in that moment. I'm assuming it's now edited away (to perfection)!Ian Cutress - Monday, June 22, 2020 - link
Anything that isn't instantly called a CPU I auto default to an accelerator card. My fault, it's been updated.mode_13h - Monday, June 22, 2020 - link
It sounds like each chip has 4 cores that act kind of like a host.eastcoast_pete - Monday, June 22, 2020 - link
Thanks Ian! Isn't one of the key differences of this new supercomputer to most others in the top 5 that it doesn't rely on GPU-like accelerators for its speed ? That makes setups like Fugaku more broadly usable, at least as far as I know or was told.mode_13h - Monday, June 22, 2020 - link
If SVE doesn't have inter-lane operations, then I don't see it being materially different than a GPU. To get good performance out of it, you're going to have to program it much like one.eastcoast_pete - Monday, June 22, 2020 - link
Don't disagree, except that this setup can actually run programs that aren't specifically written for GPU (or wide SVE), where accelerator-based systems often simply cannot. And, at least according to people who know this much better than I do, the need to have a program (and a problem) that is suited to limited routines (GPU-type or SVE) can be a real monkey wrench when you want a solution for a problem really quickly. Also, how long does it take to customize a computational approach to an accelerator even if it can be done? I can see how the time savings from the use of a supercomputer could be eaten up by the delay in getting the program ready rather frequently.jeremyshaw - Monday, June 22, 2020 - link
The main advantage of this sort of layout is the FPUs should have access to the CPU registers. Same with the cache, at the same speeds as the rest of the execution resources.Also, hopefully less context switching penalty (or failed branch penalty) than a traditional GPU.
thetrashcanisfull - Monday, June 22, 2020 - link
But SVE *does* have inter-lane (shuffle/permute) instructions; how Fujitsu has implemented those instructions (pipelined or in microcode) remains an open question, but SVE does support them.name99 - Monday, June 22, 2020 - link
Of course SVE has inter-lane operations! Where do people pick up this nonsense?Obviously it has the usual reductions, but it also has a variety of standardized shuffles that generalize to length agnostic (eg interleaves) AND the TBL (table lookup) instruction which is the one (and only, as far as I know) none-length-agnostic instruction -- but it is there if you need generic permutes and are willing to code to that particular machine and nothing else.
mode_13h - Tuesday, June 23, 2020 - link
I said "if", based on the assumption that they were trying to do pure SIMD with it, in order to try and compete with GPUs. It was literally the first word. No reason you can't see it there.I think GPUs don't support inter-lane operations because they could involve rather a lot of silicon, for large vectors. Not simple interleaving, but things like horizontal arithmetic and arbitrary shuffles.
Dolda2000 - Monday, June 22, 2020 - link
You could certainly argue that the SIMD units are GPU-like, but yeah, it's a homogeneous architecture. It's really, really nice that they're even attempting that in the current year, I really hope it works out well for them. I'm also curious what the practical differences really are vs. Xeon Phi (which on paper is quite similar with 2x512-bit FMAs per core) and why they expect this to work out better. Better intetconnect?nft76 - Monday, June 22, 2020 - link
The nodes are thin by modern standards (only 3.4 TFLOPS and 32 GB memory), so the interconnect really has to be good. A large simulation will be spread over a huge number of nodes and there will be a lot of MPI communication.mode_13h - Monday, June 22, 2020 - link
With regard to Xeon Phi: exactly. It seems to me they're basically counting on the benefits of a superior ISA, but it's still not going to beat GPUs at their own game.thetrashcanisfull - Monday, June 22, 2020 - link
I think the interconnect, as well as the broader system topology, is a big part of it. I also think that a more apt comparison might be to IBM's (now ancient) Blue Gene-Q processor: a many-core processor with integrated many-port networking, scaling to large node-counts with an unswitched (peer-to-peer) network topology, and (relatively) low per-node power usage.jeremyshaw - Monday, June 22, 2020 - link
Very cool. I find it interesting that Mervell isn't going the SVE route for their 4x128 FPU in the ThunderX3. There any specific reason for Fujitsu pursuing this? Or was this a flexible ISA extension given to largely (help?) move Fujitsu away from the dying (but open) SPARC ISA?SarahKerrigan - Monday, June 22, 2020 - link
SVE gives Fujitsu access to features they had in their proprietary HPC-ACE ISA before; going back to NEON would have been a massive regression. I also expect that ThunderX4 is going to be SVE-capable. Marvell has said SVE support is likely to be coming in the future.jeremyshaw - Monday, June 22, 2020 - link
I suppose that was the crux of my question: chicken or egg?SarahKerrigan - Monday, June 22, 2020 - link
Fujitsu, who has their own compilers and profiling/optimization, starts; the rest will follow as the ecosystem develops. There are indications from EPI announcements that next-gen Neoverse is going to be SVE-capable too, for instance. Huawei's server CPU roadmap also includes an SVE-capable microarchitecture in the future.mode_13h - Monday, June 22, 2020 - link
Probably because they realized that trying to beat GPUs at their own game is a fool's errand. See my other comment (below) about Fukagu's worse power-efficiency compared with Summit.SarahKerrigan - Monday, June 22, 2020 - link
Worse at Linpack, sure. Have you taken a look at the difference at HPCG? It was discussed a bit during the Top500 presentation this morning.mode_13h - Monday, June 22, 2020 - link
Bravo for managing worse TFLOPS/W than a machine built on 3-year-old technology (18.13 vs. 19.89)./s
But, of course this would be the case. General purpose CPUs are inherently less efficient than GPUs.
SarahKerrigan - Monday, June 22, 2020 - link
Nonetheless, A64FX systems are by far the most efficient CPU-only systems on the list. That's not half bad.mode_13h - Monday, June 22, 2020 - link
Sure, AArch64 is a lot more efficient than x86-64, I'll grant them that.Also, SVE >> AVX-512. So, that's another point in their favor.
close - Monday, June 22, 2020 - link
As a whole system it's still 3 times the performance for 3 times the power. Pretty much identical power/TFLOP. The efficiency of these cores seems to be identical with the combination of POWER and Nvidia cores.mode_13h - Tuesday, June 23, 2020 - link
Again, you're comparing 3-year-old tech with cutting-edge. So, "pretty much identical power/TFLOP" is not a good thing.And by the numbers I cited, Summit burns just 91.1% as many W per TFLOPS. That's significant.
eastcoast_pete - Monday, June 22, 2020 - link
Not just not half-bad on efficiency, but also a lot more versatile. Now, I can't program for any of these, but was told by people who are using supercomputers (or fractions of runtimes, to be precise) that there are plenty of situations where it's actually highly desirable to have "just" a whole bunch of really powerful CPU instances to program for. I also believe that was one of the stated goals of Riken when they commissioned Fugaku. GPU and NPU accelerators can be extremely effective, but they are more limited in what they can do. My own, simple minded explanation is that's why we still have CPUs in our PCs; the dGPU is much faster at its tasks, but the CPU can do pretty much anything you can program for. Otherwise, why bother with a CPU?Zizy - Monday, June 22, 2020 - link
Which other CPU-only system uses purpose-made parts these days? There were some IBM's projects in the past but this is the only recent processor for HPC so it is naturally the best.surt - Monday, June 22, 2020 - link
And 2.5x performance on only 2.8x power!nft76 - Monday, June 22, 2020 - link
Calculating TFLOPS/W with just Linpack performance gives a rather one-sided view. In HPCG Fugaku is more than twice as efficient as Summit (assuming similar power consumption as in HPL; they don't seem to give power usage in HPCG).mode_13h - Monday, June 22, 2020 - link
Still, you're comparing with old tech. The real test will be to see how it compares with Nvidia's A100, which seems to be much more adept at dealing with things like sparsity.nft76 - Monday, June 22, 2020 - link
Nvidia's own A100 installation is included in the list and it's not much better than computers using V100 in HPCG (in fact, ratio of HPCG to peak performance seems to be a bit worse).That said, Fugaku's HPCG-to-peak ratio is high also compared to other CPU-only systems, so maybe it's more that the custom interconnect is really good for the problem.
anonomouse - Monday, June 22, 2020 - link
As another comparison point, Fugaku is also tops (by quite a lot) the list for Graph500, for which Oak Ridge Summit only even submitted a CPU-Only run. I don't see any GPU runs at all, with the closest match being some KNL-based systems and the chinese Sunway machine, which has their own custom KNL-equivalents. Supposedly the new A100 GPUs are supposed to be a lot better at Graphs, but it's pretty telling on architecture flexibility that there are no GPU submission to date for this.And based on the number of nodes and cores used for Fugaku in the submission, it only used a bit over half of the full supercomputer.
mode_13h - Monday, June 22, 2020 - link
> they were so keen on getting the supercomputer up and running to assist with the R&D as soon as possible – the server racks didn’t have their official front panels when they started working.Oooo, scandalous! How dare they!
quadibloc - Monday, June 22, 2020 - link
That's about 3.77 Teraflops of FP64 per chip, which is indeed pretty good.Ushio01 - Monday, June 22, 2020 - link
So if Fujitsu is working with ARM CPU's for supercomputers does that mean they have abandoned development of SPARC?eastcoast_pete - Monday, June 22, 2020 - link
From what I remember, yes, that's pretty much the story. Fujitsu developed this chip to have a follow-up to SPARC; it probably didn't help that Oracle finally killed SPARC off after strangulating it for years.Andrew_Waite - Monday, June 22, 2020 - link
ARM based, so a BBC Micro with a few bells and whistles then :-)yetanotherhuman - Tuesday, July 7, 2020 - link
Hm, if memory serves, the BBC Micro used a MOS 6502 like the NESname99 - Monday, June 22, 2020 - link
You don't think it's sigificant that you can hit the same (Linpack) performance/watt as a GPU, but on a system that's a much more "traditional" architecture (ie easier to port to, easier to match to a variety of different algorithms)?Which gets at the second point: Linpack, ie dense linear algebra, is well known to be a terrible metric of this type of machine -- if what you want is dense linear algebra, you can do far better on dedicated hardware (that's why everyone is adding TPU's to their designs...).
Much more interesting is the performance on something that's not quite so trivial. There Fujitsu gets closer to 4.6x the IBM/nVidia result. So roughly they're 30% more efficient than IBM/nV for more "generic" code.
Robin Whittle - Tuesday, June 23, 2020 - link
The most important aspect of this system is its HBM (High Bandwidth Memory) - broad, very fast, connections to fast memory chips. These need to be physically close to the CPU chip. There are no DIMMs in this system.With a quick search I didn't find any concrete details on the HBM, but this page has a picture of two HBM chip per 12 (or 13) cores: https://www.reddit.com/r/Amd/comments/9vyd1h/intel... .
Main memory bandwidth has been the biggest bottleneck for most computing for decades, other that which can be done with GPUs.
Does this device have a single memory space, or do the four quadrants of 12 (13) cores each have their own space. If the former, then there will be a big latency and bandwidth restriction accessing data in the HBM chips of the three other quadrants. If the former, then the total RAM available to programs is set by the size of the two HBM chips.
Can anyone point to more details on the HBM used in Fugaku?
anonym - Monday, June 29, 2020 - link
Four CMG(Core Memory Group) connected to 8GiB HBM chip each.https://www.hpci-office.jp/invite2/documents2/ws_m...
BurntMyBacon - Wednesday, June 24, 2020 - link
Yeah, Single Board Computers are not exactly common place in your typical commercial space or even much of the professional industry as far as I can tell. I've only actually come across them in the super computing space, avionics, and a handful of other embedded systems scenarios. I might have made the same mistake.Santoval - Sunday, June 28, 2020 - link
So ExaFLOP computing was just achieved - well, for single precision (FP32) computing anyway. And only for peak performance, not max sustained.