I used X-Gene and eMAG (never XG2 - never even saw one in the wild - but iirc it was a conservative modification of XG.) They were not stellar cores. Altra was a huge breath of fresh air, Neoverse has become legitimately very good and has a strong long-term roadmap... and they want to go back to custom microarchitectures?
Because there’s nothing they do that ms, fb, google and others can’t do in house. And they probably are already doing so. They see the writing on the wall. Their best chance was to get acquired by a hyper scaler but seems like they could not. Trouble with custom cores is that they are up against competition doing cores for decades. I see trouble ahead
I would have thought Ampere would go the risc-v way, specializing in a few strategic IPs to keep a performance or performance per watt advantage on competition.
"Trouble with custom cores is that they are up against competition doing cores for decades. I see trouble ahead "
the trouble with 'developing' any kind of thingee that's rooted in maths, and cpu (and all the bells and whistles moved on-chip) are surely maths, is that over time, and not always much of it, there is discovered One Best Solution. the mainframe hasn't changed in decades. OTOH, the 8086, in all its variations, was not the One Best Solution, but managed through pure luck and some questionable behaviour, to become The One Solution in microprocessors. in due time, if not already, there will be One Best Solution at the micro-arch level, too. moving evermore off-chip function on-chip isn't exactly any form of innovation, just incremental lock-in to the chip.
To be blunt, you have no idea what you are talking about. Modern chips are all about well-known tradeoffs along various dimensions, and there is definitely not One Best Solution. Your comment about mainframes not changing in decades is also just factually wrong - new Z mainframes are immensely different from last century's mainframes, they just have really good backwards compatibility.
"new Z mainframes are immensely different from last century's mainframes"
it's still the same CISC architecture it always was. registers are wider and such hardware advances (pretty much just a microprocessor implementation), but just as 8086 remains the core, so to speak of Intel/AMD, so is the 360 for z. it's backward compatible just because it's the same. if it weren't the One Best Solution, then there'd be a host of other mainframe, aka COBOL, machines. Sun, among others, tried to take the mini/server construct to overthrow the IBM mainframe. none pulled it off.
of course it is: it executes the 8086 ISA. it also executes some additional instructions. same for the 360 to z. does the current 86 cpu have a unique ALU not available in 1978? does it not execute original 16 bit instructions? of course it does. 99.44% of the 'innovation' since the 360/30 or 8086 has been in widening registers, piling on on-chip caches, and pulling off-chip functions on-chip. the killing off of the Seven Dwarfs and, later, the non-86 ISA microprocessors (Power survives; ARM is still up in the air) has led to a computer monoculture. well, duoculture.
backward compatibility is almost entirely about OS stability, since that's what applications know about. keep the system calls stable, and your machine is backward compatible. that's the reason you can run 360/DOS code on today's z; not that you would want to. OTOH, up to a few years ago, dedicated 360 code was still running; it ain't broke so don't fix it.
OTOH, as is well known, any machine with a C complier can run *nix, IOW at least one OS (well, a few of them) don't much care about the ISA.
I was confused to see your name one day when I was flipping through the dictionary, until I realized your name turned up as an example for Dunning-Kruger.
Leaving aside the rest of your questionable claims...
> any machine with a C complier can run *nix
You need a MMU for a proper UNIX-like OS. A lot of DSPs and microcontrollers that are C-programmable don't have one. So, no. It takes more than a C compiler.
You know absolutely nothing about what you are talking about. The original 8086 didn't have a shit ton of stuff in it. What you are just arguing is that since the model of computation is the same as it was over 100 years ago(i.e. Turing Machines/Lambda Calculus), then nothing has changed. And also ignoring branch predictors, superscalar cores, out of order execution, etc etc. Or that even modern x86 cores might not even be technically x86 anymore as they decode the ISA into macrops and those into microps that might be arguably an 'internal ISA'.
> The original 8086 didn't have a shit ton of stuff in it.
Yes. The UNIX security model depends on memory protection, which x86 lacked in any form until the 80286. However, I think x86 didn't have a proper MMU until the 80386.
one can, and MicroSoft did, build a business on a Real Mode *nix, with a software patch to 'emulate' protected multi-user mode (I don't recall a cite, boo hoo). they called it Xenix, and it was built to the 8086, later others. until Kildall blew off Armonk's ask for a CPM/86, who then went to MicroSoft for a control program for the 8088, MicroSoft had no interest in anything but *nix. as usual, Bill lied and said, 'sure, we've got one of those'. so he conned DOS from Seattle Computer and the rest is history. the 8086 was only a real mode processor, which required the patch to run Xenix multi-user. the 80286 supported both real and protected mode, alas, it couldn't return to real mode from protected mode except by cycling. the 80386 was the start of something Big. as to moving all those other functions on-chip, that didn't happen as innovations by the microprocessor designers, but solely because the hardware folks gave them increasing transistor budgets. all of the functions mentioned had existed for a couple of decades on mainframes (and some minis, notably DEC), which already had those transistor budgets, and price points to match.
"IBM did things in the 1960's that most people think weren't invented until the 1980's."
and it wasn't, mostly, IBM who did the innovative things. Burroughs implemented virtual memory in the B5000 in 1961, years before IBM. the other 6 dwarves invented most of the other stuff IBM gets credit for. well, may be not RCA; they ended when they got caught 'cloning' the 360.
No, there's not. 9 + 5 = 5 + 9 and are 2 different ways to come up with 14, equally "best". 5 + 5 + 4 is longer, but might have a less complex structure in design, ending up being able to clock higher, for instance.
So, no, most mathematical problems have several solutions, all of them being correct. The application of it determines which method works best. In short: If you take into account the whole chain of commands, you can only try to find the best solution for that specific use-case. Which might be a disaster on a different chain of commands.
ps. To make it even more complex: 5 + 5 + 5 - (1/5 * 5) gives you 14 as well. And although it requires a lot more instructions, that might be the trade-off for a much simpler design. If you can have 4x as much compute power in your design, while doubling the amount of instructions, you are still twice as fast.
I think this must be it. If you're an pure IP house in the business of packaging someone else's IP, well you could make a decent systems integration business out of selling to smaller firms but not to hyperscalers who have more scale than you. I suspect enterprise/smb customers are sticking with x86, so you're getting squeezed out of both ends of the market.
I agree that this is a surprising move, one that I don't know will pan out, but their alternative looks grim.
I think you are right. Why would a hyperscaler buy ARM-designed cores from Ampere when it can do the same modifications itself, except more tightly tuned to its individual needs? At the same time, the time frame for (potentially) selling ARM cores to enterprises is far beyond the financial viability of Ampere. So for Ampere it's a matter of survival to try to design its own cores. Plus having its own cores certainly makes it a better takeover target than having a business based on designing around ARM cores. Engineering teams can be hired away. IP stays with the company. Cf. Apple and Imagination Technologies. Imagination ended up being acquired by a Chinese company and Apple went back to licensing IP from it even after Apple had hired away a lot of its engineers. Ampere's hope is to develop and show something worth being acquired for, like Nuvia did.
With most comanies starting own designs they just work on trying to increase value fast enough and then get sold for IP and especially Talent by a big player to extend their own capacities and talent.
Interesting. That's what Amazon did to get Graviton, buy a startup working on customized ARM cores to bring that expertise in-house. Microsoft could end up doing the same with Ampere but with the added bonus of having ARM-compatible custom cores like what Apple has. It's still important for ARM to continue licensing out core designs as a backup plan because any of these custom cores could go the way of Exynos.
Thing is, their engineering team is doing cores for decades as well - Intel x86-64 cores to be precise. At the moment, they just have to convincingly beat the x86 competition, which should be doable - beating other ARM offerings is of course going to be harder.
Yeah, they have to try and differentiate themselves from everyone else who's just taking ARMs stock IP. If custom doesn't work, they could always revert to using ARM's stock IP, but the big money is in substantially beating ARM, for at least a few market niches.
I think they can probably do it, too. Especially if Neoverse continues to be just mobile cores on steroids.
There's absolutely no chance Nvidia can buy ARM. US will have to lift all the semi-conductor bans and promise never to do so again for China to approve the deal.
Milan is already out yet they are still comparing their next gen 2022 product against AMD previous generation EPYC Rome. By the time this launches they will compete with AMD EPYC Genoa with 50% more cores on 5nm so they are almost back to where they are now with 80 ARM cores vs. 64 x86 cores. If Ampere managed to learn anything from EPYC ramp over Xeon is that hey need to provide huge benefits and have a proven track record to gain market share, not just barely compete.
Yeah, sure. Ampere is going to get 60% more cores and everything else needed to support them (memory, IO) for free and it will work across all workloads, not only the cherry-picked ones on the slides. The 80 core Altra is already monolithic and if they remain monolithic at 128 cores then they yields are going to be cut down dramatically. It it's not monolithic than every latency workload is going to get severely affected. There's no free lunch.
Remember once again, this is targeted at cloud workloads, so not all workloads are expected to improve. Compute intensive workloads will see major increases but bandwidth intensive workloads won't until DDR5. If AnandTech runs the same set of benchmarks, Altra Max should beat Milan in all except for SPECjjb.
They will have great yields - with so much silicon for cores, most defects will be in the cores. Those are sold as parts with 112 and 96 cores.
Exactly that. With so many small-ish cores and accompanying lumps of cache memory, handling yield issues will look similar to how they did things on late-2000s-era GPUs.
Besides Altra Max being a 2021 product, as SarahKerrigan already pointed out, even in Q2 AMD expects more Rome sales than Milan sales. AMD expects both Rome and Milan to be selling right through to the end of 2021 and "by the third quarter that it would cross over and Milan would perhaps be higher than Rome." So a comparison to Rome is certainly appropriate. As for leaving Milan out, that could be inappropriate, but it may be appropriate to the segment of the market Ampere is pursuing.
BTW, why'd you point out the Rome thing and not Cascade Lake?
They could have chosen Milan which is available for some time and can even be benchmarked on the hyperscalers instances but then even their current Altra would look just behind in their slides. Why buy and test the Altra when it's already behind even in the cherry-picked benchmarks? Ice Lake launched later and Intel only sold like 150k units until launch so they have some excuse for not benching against it, although I'm sure that they could have gotten their hands on one if they really wanted to. Heck, even youtubers manage to get parts under NDA that are not provided to them on launch day, but somehow a company as big as Ampere can't manage to get their hands on a single sample.
It's not obvious at all Milan would do better than Rome in those particular benchmarks. Even on SPECINT_rate Milan is disappointing as the large IPC increases of Zen 3 are not showing up. In fact the gains are mostly due to increasing power consumption by 25%. If you limit Milan to the same TDP as Rome, it ends up SLOWER as AnandTech showed here: https://www.anandtech.com/show/16529/amd-epyc-mila...
Contrast that with Altra Max which runs 60% more cores at the same TDP.
> So a comparison to Rome is certainly appropriate
No it's not, by your own admission in Q3 where Ampere Max would certainly not be selling at any more than Quicksilver, Milan would outsell Rome, why would it be appropriate to compare your products to competitors' older and lower volume products when new products have launched for some time and is expected to be the main competitor?
Because the performance difference between Rome and Milan in these benchmarks will be small-to-non-existent. There's no need to take our word on this, you can happily wait and see.
Ampere Altra already matches (and often beats) Milan as per AnandTech's review. Altra Max opens a huge gap over Milan, and so should match Genoa as well. Ampere's next generation will then increase the gap even further. Core count might increase beyond 128 as suggested in the slides. That's not barely competing but leading performance at significantly lower cost and lower power.
Aren't you going to mention any of the workloads where even the 28 core 8280 beats the 80 core Altra? And now there's the 40 core Ice Lake version and Sapphire Rapids incoming. AMD hasn't been able to ramp up even with 2x performance gap over Xeons because it's not just about superior performance. The platform needs to be stable (they just destabilized it by going custom with in-house cores) and engineering support must be there. Customers aren't going to wait months for their tickets to be solved just because Ampere doesn't have the thousands of engineers necessary to support their ramp.
There was only one workload where Altra did really badly (SPECjjb), but in almost all other workloads (SPECINT_rate, LLVM compile, NAMD) even 2x 8280 was not able to match a single Altra. In SPECFP_rate 2x 8280 was barely faster than 1x Altra. The incoming Icelakes have to compete with Altra Max.
> in almost all other workloads (SPECINT_rate, LLVM compile, NAMD) > even 2x 8280 was not able to match a single Altra.
You're perhaps focusing too much on highly-scalable workloads. Both brands of x86 CPUs are still well ahead in per-core (if not per-thread) performance.
As for per-thread, Altra has 62% higher performance than 7763! So per-thread performance is absolutely awful on x86, I don't see how you can possibly claim x86 is ahead with a straight face.
> As for per-thread, Altra has 62% higher performance than 7763! > So per-thread performance is absolutely awful on x86
Yeah, but you surely know that's an apples vs. oranges comparison. The marginal cost of the hyperthreads is like 10% more area. So, it ends up being a win to add SMT. Otherwise, they wouldn't do it.
And let's not forget that, when you're comparing per-thread performance, the 7763 has 128 threads vs. only 80 threads in the Q80-33. So, the concept of having 2x as many threads that are less performant is cut from the same conceptual cloth as Ampere's idea of using a larger number of slower ARM cores. You can't really criticize the SMT of x86 CPUs without implicitly criticizing ARM and Ampere's approach of making up for the slowness of their cores by using more of them.
And AMD nets 12% better aggregate performance and a win on SPECint2017 Base Rate-N from the technique. If the shoe were on the other foot, you'd be trumpeting how smart a move it is for ARM to be doing it.
But we can all now see you're not here to have a good faith discussion of the relative strengths and weaknesses of these CPUs. You're cherry-picking and spinning literally every single thing you can, to make Altra look like it dominates, when it's merely competitive.
Assuming SMT adds 10% area on x86, getting just 12% better throughput is not a great advertisement for SMT. There are of course scenarios where SMT does better, but I don't believe SMT makes much sense in servers. With Arm's smaller cores, it's better to add more cores.
When Altra Max benchmarks are out, you will see that adding real cores works out better than adding SMT. Altra Max will outperform Milan by a good margin at lower power. Its performance per thread will also be higher than an SMT thread on Milan.
If you still didn't get the importance of the historical achievement of a licensed Arm server core beating Intel/AMD's latest and greatest, Altra Max will surely do it.
Intel states smt area and power cost is less than 5%. So it’s almost for free for 2way smt. https://software.intel.com/content/www/us/en/devel... Moreover smt can be turned off if your application does not need it such as some cases of HPC.
And dream on… if adding cores by cutting caches results in higher per thread performance then caches have no merit. Which is utter nonsense. Altra 128c is rumored to have 16MB total LLC to make room for more cores. altra 128c is spec rate accelerator not a general purpose server. See spec jbb results. Altra is calxeda redux: posts high spec rate scores, with crap latency performance under load.
Those kinds of claims are likely as trustworthy as 10nm status claims - without a detailed paper explaining which buffers were duplicated, how many bits were added to all the cache structures to differentiate between threads, it's hard to know how it was computed.
Some Arm core have SMT so the cost on Arm design is known (but not public). Those cores haven't been big successes, and no new SMT cores have been announced since, so that suggests it is not as easy or cheap as claimed. Given the myriad of security issues with SMT, I'm not expecting future designs to have SMT either.
Read what I said: Altra Max will beat Milan on overall performance and per-thread performance. Adding more cores will of course reduce performance per core due to cache/bandwidth limitations, but throughput still goes up significantly. Altra Max is special purpose like A64FX - with Arm you will see different designs targeting specific niches.
> that suggests it is not as easy or cheap as claimed.
Or, maybe just that it doesn't make as much sense on mobile, which is still ARM's top priority. So far, all their server cores are still derived from mobile ones.
> Given the myriad of security issues with SMT
This is another potential explanation why ARM didn't bother to pursue SMT. However, recent kernel work by Google & others have eliminated SMT-related security issues by optionally preventing threads from different processes from sharing a physical core.
> Altra Max will beat Milan on overall performance and per-thread performance.
If they indeed accomplish that by further reducing LLC, then it could be somewhat of a Pyrrhic victory. It will be telling if a number of benchmark scores actually *drop*, relative to the 80-core model.
> Assuming SMT adds 10% area on x86, getting > just 12% better throughput is not a great advertisement for SMT.
It's not. Now, I don't know if they actually disabled HT or just relied on the thread scheduler to execute only one thread per core. Switching to the 1S results should control for that a little better, in which case EPYC 7763 gets a 16.2% benefit. The 7742 gets a 15.4% benefit, and Xeon 8280 gets a 16.8% benefit.
FWIW, my experience compiling software has shown a much bigger benefit from SMT. So, that suggests the amount of benefit is highly workload-dependent. This makes intuitive sense, since we know that the amount of ILP in code & its cache hit-rate varies quite widely.
> I don't believe SMT makes much sense in servers.
It works well for GPUs!
> When Altra Max benchmarks are out, you will see that adding real cores works > out better than adding SMT.
Could be, though PPA (and therefore perf/$) will be lower than the 80-core versions. And those are Altra's biggest selling points.
Anyway, I'm a lot more excited to see the V1 benchmarked. It's going to be interesting to see just how well 2x256-bit SVE measures up against AMDs 4x256-bit and Intel's 2x512-bit vector engines.
> If you still didn't get the importance of the historical achievement ...
I get it.
I expected N2 would be the next nail in x86's coffin, but its ~3% worse efficiency has me wondering if it will really quicken the pace of the ARM transition.
I was also a little disappointed to find nothing revolutionary in ARMv9. I hope ARMv10 is a more fundamental rethink of the ISA.
Sorry, that was an incomplete thought. The Max should have better perf/W, which is probably Altra's biggest selling point. Depending on how much better, it could easily overcome the perf/$ difference.
in this paper, google claims the effect of 2-way SMT is ~40% on haswell (much higher on power8, but power8 has worse single threaded frequency and is designed to be a >2-way SMT machine). See Fig 2b: https://web.stanford.edu/~kozyraki/publications/20...
It's about predictability. If they still used Neoverse designs, ARM announces PPA gains much earlier, and customers can do their own napkin math on the perf/core count for the next generation of Ampere, and plan accordingly.
Are we reading the same review? Where did it say Altra outright matches Milan? It can't even guarantee to match Rome.
"The Altra Q80-33 sometimes beats the EPYC 7742, and loses out sometimes – depending the workload. The Altra’s strengths lie in compute-bound workloads where having 25% more cores is an advantage. The Neoverse-N1 cores clocked at 3.3GHz can more than match the per-core performance of Zen2 inside the EPYC CPUs.
There are still workloads in which the Altra doesn’t do as well – anything that puts higher cache pressure on the cores will heavily favours the EPYC"
Did you actually look at the Milan vs Altra results? Altra matches Milan on SPECINT_rate and beats it on NAMD. LLVM compile is close, the only one Altra doesn't do well on is SPECjjb.
So, to say that "Ampere Altra already matches (and often beats) Milan" is not accurate. It often matches or beats Milan, but Milan still pulls a few big wins and edges it out in several other cases.
And if you look at the aggregates, Altra really gets its ass handed to it by Milan on SPECfp2017 Base Rate-N. Kinda funny how you never mentioned fp.
If you had read the Altra articles you would know Ampere explictly aims Altra at integer cloud workloads and even down tuned floating point for better integer performance. Hence integer is much more important than FP.
Still, it doesn't do bad at all on fprate: the 7742 is only 2% faster, the 7763 is 15.5% faster.
So you agree Milan really gets its ass handed to it by Altra on NAMD since it beats Milan by 16.7%?
> Ampere explictly aims Altra at integer cloud workloads and even down tuned floating > point for better integer performance.
You're saying they *customized* the N1 cores in Altra to have less floating-point performance?? Citation needed!
> So you agree Milan really gets its ass handed to it by Altra on NAMD since it beats Milan by 16.7%?
Sure. The data is reported as you say, although I'm unclear on their point about multicore vs. MPI and whether that means they ran different modes on the CPUs.
Also, I wasn't the one trying to characterize the relative performances of these CPUs, so it contradicts nothing I've said or implied.
Lastly, it's one test case vs. 22 in the SPEC2017 benchmarks. I think the fact that it's reported as a separate result is over-weighting its importance, whereas it should probably be viewed on par with any of the individual SPEC2017 tests (or perhaps less, considering their misgivings about it's core-local nature and the build inconsistencies).
Ampere altra is perfectly fine for latency insensitive workloads like webhosting. But if you care about tail latencies, the specjbb result makes its failings obvious. This has less to do with arm and more to do with amperes poor architecture choice of insufficient LLC. Now obviously if raw single thread performance matters, a lower core count x86 is your best bet. See https://www.anandtech.com/show/16529/amd-epyc-mila... for per virtual cpu perf. Note the F optimized part with smt off on the top of the list. More cache on the ampere part wouldn’t have done much for spec rate but would have avoided the obvious glass jaw that spec jbb demonstrates. If you know for sure that you are only ever going to run latency insensitive batch jobs, ampere is more than adequate. If tail latency matters to you, the spec jbb data point indicates a huge red flag. A 2P altra loses to 1P milan on tail latency. And this is amperes fault for wanting to win on spec rate at all costs. If you don’t know what might be running, a sane person would prefer the design that has fewer glass jaws. Google has made much of the importance of tail latency as a reason why fewer beefier cores are better than a flock of weak ones. I suspect a lot of cloud workloads are like that but I have no data other than google papers. But yeah if you have an application specific server like web hosting, altra is adequate.
Also what is namd representative of? It doesn’t predict spec fp rate despite being a part of it. Due to its tiny working set of under an MB, smt likely hurts it but AT folks don’t report an smt off score. So sure it’s something but it’s about as representative as AMDs favorite cinebench. I think the AT test suite for arm is biased towards things that can be built and run reliably and fairly on both platforms. Perhaps that explains its inclusion despite commentary from them discounting its utility as a benchmark
> You're saying they *customized* the N1 cores in Altra to have less floating-point performance?? Citation needed!
Absolutely, and not a little either. SPECINT and FP scores are normally quite balanced, for Graviton they are within 7%: https://images.anandtech.com/graphs/graph15578/115... . However on Altra they differ by 37%. So Graviton 2 has effectively 28% faster FP than Altra despite both using the same N1!
> > You're saying they *customized* the N1 cores in Altra to have less >> floating-point performance?? Citation needed!
> Absolutely, and not a little either.
Again, I'm looking for specific claims that Ampere did something to the N1 cores in Altra that directly reduced their FP performance. Not benchmark results, which could have many explanations.
> There are many different settings to customize Arm cores
But not the number of FP pipes, issue ports, or latency! Prefetchers *could* have disproportionately impacted FP benchmarks, but they're not specific to floating-point and can also hurt integer code.
With the Milan review, the only thing Q80-33 was able to win is NAMD. All others ranges from barely kept up to losing by a HUGE margin. 2xQ80-33 can't even match 1 EPYC 7742 in MultiJVM critical-jOPS and only 27% better max-jOPS, that's 160 cores/2 sockets vs 64 cores
Besides NAMD, Altra also wins most benchmarks in SPEC_rate - I count 12 wins in 22 int+fp benchmarks, often by a good margin. Altra Max will do even better. The only big loss is SPECjjb (and likely will be only loss for Altra Max) - I guess Ampere is not optimizing for that workload.
Namd happens to have a tiny data cache footprint of less than an MB. Not surprisingly, it scales linearly in the number of cores. In every review of late, the AT authors say they don’t like this benchmark but they post it regardless for reasons I don’t understand.
Ampere altra is designed to “win” in spec rate at all costs. Rumors on the internet (somewhat credible) indicate that they cut the L3 cache to 16MB to fit 128cores. 60% more cores apparently give 19% more perf. See how nonlinear that is? I told you so a while ago… spec rate scales in number of virtual cpus until you run out of dram bandwidth. Cache capacity does not matter. Just add cors and watch it fly. Memory bw saturation has already been hit around 64c beyond which additional cores become less effective. Milan is also affected by this, where higher IPC does not buy as much throughput in 64c configs. Ampere crippled themselves further by removing cache and increasing dram pressure. All to win at a benchmark that is maybe vaguely representative of latency-insensitive homogeneous batch processing. Spec jbb, on the other hand is a latency sensitive server load and amperes 80c design in a 2p config is barely ahead of a 1Pmilan in the AT review. I expect the 128c version to suck even more but let’s see.
If anything, you have a clear bias. You are the one who falsely claimed that x86 wins both on per-core and per-thread performance despite clear evidence of the contrary.
I agree. I think we need someone like ASUS to start selling motherboards that take in a new CPU-Socket. However, this one is designed for ARM processors, and the socket details are open-source for any company to design for. Let's say it scales from 1-core upto 32-cores, and the kit is reasonably (USD $200) priced. On top, it comes with an open-source Stable Debian OS.
Now everyone can get their hands on an ARM kit, and do development straight on the product. Wether that's a Qualcomm chip, Samsung, HiSilicon, MediaTek, Rockchip, Unisoc, VIA, AMLogic, or Allwinner doesn't matter.
And I suspect very quickly we would see an official Windows10-ARM build, AndroidOS, Custom ROMs, and lots of Linux Distros. Perhaps even a Hackintosh. With this comes the mainstream transition of "big programs" that usually live in Windows, to get officially ported to ARM, without the need for dumbing things down for mobile.
It would be extremely handy for companies to get their hands on this kit, mainstream, with good support, cheap pricing.... to basically just play with it, and find out if their software (or solutions) are adequately fulfilled. I suspect for most it would be. Which makes it much easier decision for them to transition away from x86-server to an ARM-server. They could potentially change out their Server-Connected Tiny Office PCs (eg NUC) to be running from a small ARM box as well. If or when this happens, we should see IT/Software Professionals getting better and better work market, and the older AMD/Intel duopoly having to massively cut retail costs (-70% ?) to remain profitable.
PS: I know about the Altra Workstations, but they're more of a niche from a niche company. And priced ridiculous, with lots of lock-in options. I wanted something that's reasonable, well-priced, aimed at the mainstream, by a reputable mainstream company (eg ASUS, Gigabyte, MSI, etc etc).
You're missing one big problem with your theory. Different chips have different amounts of I/O and different pin-outs. Therefore, a universal MB socket would be unattractive and you'd get lots of "supports up-to" statements when purchasing.
So, linking to crowd supply is obviously spam. Ugh! I've posted a wiki article for you instead. You'll have to do a quick google for the actual EOMA68 project.
This is genuinely great to hear. If even Apple, a lifestyle company in California, can commit engineers, resources, and time to custom uarches for consumer-only tech gadgets, why not everyone in the hyperscaling market?
Honestly, though, if your microarchitecture roadmap tells the world it’s slowing down its cadence, people believe you. Would you want to tie yourself to a roadmap that all but confirms “don’t expect any big leaps” soon?
Neocerse’s “big” numbers (“40% higher ST performance N2 vs N1”) are between three generations: it’s comparing a 2018 uarch with a 2021 uarch (12% ST gain per year / generation).
And Makalu + Matterhorn are only promising 14% ST improvement per generation.
But. It’s not easy for Ampere, either. Ask Intel or AMD. It’s not hard to hit a uarch performance drought versus your competitors.
The thing is, that graph is trying to be honest. Nobody else can sustain such large yearly IPC improvements. Moore's law is still holding on, but don't expect large frequency uplifts from new processes either. So everybody is in the same boat.
Your math is off as well - Neoverse N1 is from 2019 and Neoverse V1 is in 2021, so that's 50% uplift in 2 years. N2 is a later, efficiency optimized core with slightly lower performance, but that doesn't change the performance gain of Neoverse V1 - unless you're trying to downplay the significant performance gains made by recent Arm cores...
The Cortex A77 was released in 2019 and it has not been used a single Neoverse CPU.
//
And then after Matterhorn and Makalu? Surely NUVIA, Apple, Qualcomm, Ampere, and everyone else have seen Arm’s stock core roadmap. All of them used Arm cores for a few generations (NUVIA excluded) and went to custom.
You don’t need to beat Arm by 50%. Even a 10% IPC win versus Arm per year adds up.
Arm stock microarchitectures aren’t bad, but they’re not performance leaders. Likewise, It’s not just IPC, but perf-per-watt, too, as Ampere wrote.
Samsung, NVIDIA, Amazon are all hanging on, but I don’t expect any performance leadership from any of their designs.
No your math is way off. Neoverse N1 was announced in February 2019 with implementations late 2019. Neoverse V1 was just announced with first implementations likely later this year. So that's 50% performance gain in 2 years. It's the same for Cortex-A76 to Cortex-X1. The roadmap shows another 30% in the next 2 years, so that averages to 18% per year over 4 years. That's very hard to beat.
Many companies have tried custom cores but gave up (QC are at their 4th try with Nuvia and kept using Cortex cores alongside custom designs). It's possible to differentiate and get a 10-20% gain but gaining 10% yearly is unlikely. Apple is ahead because they spend many times more area on cores and caches than everybody else and pay a premium for early access to the latest processes. Those are one-off gains, you can't keep doubling your caches every year...
Ampere Altra is proof of performance and efficiency leadership for a stock Arm core. From the article the decision to go custom seems more about specialization for the cloud market. You could cut down the floating point and Neon units if integer performance is all you want.
Btw it's actually possible Arm's latest Neoverse cores are simply too fast. A custom core that is say only 20% faster than Neoverse N1 (but at same area by cutting down FP/Neon and 32-bit support) might well be more suited for certain cloud applications.
> it's actually possible Arm's latest Neoverse cores are simply too fast.
LOL.
> cutting down FP/Neon
They're already #losing on FP. And the N1's vector width is only 2x 128-bit, as compared with Zen's 4x 256-bit and Intel's 2x 512-bit. So, it's not like the N1 is burning a ton of die space on it, or has a lot of excess FP performance to spare.
Ampere is aiming at integer cloud workloads. N1 achieves 86% of the FP performance of Milan with just 2x128-bit pipes, which is likely more than Ampere needs. Since N1 is absolutely tiny compared to Zen 3 or IceLake, the 2x128-bit pipes are actually a significant fraction of its die size.
> the 2x128-bit pipes are actually a significant fraction of its die size.
Let's take this as given and look to the N2. It's not increasing its vector pipelines in either width or count (but they report some wins from simply using SVE), in spite of going to 5 nm. So, if they're not increasing the N2's FPU, in spite of the overall core size getting larger, then that means it'll constitute a smaller portion of the N2. That naturally leads to the question of what Siryn stands to gain, by making an even *smaller* FPU? I'd say: not much.
N2 adds SVE2 so the vector pipelines are larger and more complex. FP performance is likely 40% higher than N1, so you could cut out a lot if all you wanted is bare-bones FP/SIMD like Ampere's eMAG generation.
Saving 10-15% area directly translates to more cores/cache and lower costs. We are talking about 128+ cores so small savings per core add up.
> N2 adds SVE2 so the vector pipelines are larger and more complex. > FP performance is likely 40% higher than N1
Why would you expect the same number & width of pipelines @ the same clock speed to deliver that kind of speedup? According to ARM's own estimates, the N2 delivers only about 20% speedup at about 60th percentile. Median speedup looks to be only about 15%.
> Saving 10-15% area directly translates to more cores/cache and lower costs.
It's very unlikely one 128-bit SVE pipeline takes 10% of die area, and you can't have less than 128-bits in ARMv9. I suppose they could simplify the pipeline stages at the expense of latency, but that could really start to hurt code that uses even a little floating-point, which is probably more common in "server workloads" than it used to be.
> Why would you expect the same number & width of pipelines @ the same clock speed to deliver that kind of speedup?
Because the rest of the core is improved. The bottleneck is usually the frontend and caches rather than the FP pipes. For example the Cortex-A77 article shows a 23% speedup over Cortex-A76 on SPECINT_2017 but a larger 28% speedup for FP despite no reported changes to the SIMD pipes: https://images.anandtech.com/doci/14384/CortexA77-...
You can reduce the area significantly, eg. using a single 64-bit FMA pipe that needs 2 cycles for 128-bit SIMD. That would work fine with code that needs a bit of scalar floating point.
> You can reduce the area significantly, eg. using a single 64-bit FMA pipe that > needs 2 cycles for 128-bit SIMD. That would work fine with code that needs > a bit of scalar floating point.
Alright. I'll concede this point. Maybe one of the points of differentiation in their cores is less area devoted to FP. I doubt it, but we're into pure speculation, here. I certainly can't say you're wrong.
Let's just hope that whatever they've got in the works is compelling and somehow meaningfully different from what ARM has announced and the other offerings their competitors will have on the market. I do want to see Ampere succeed and live on to mature into a more formidable player.
Read it again: Arm releases a new uarch every year. Arm does not release (or simply cannot release due to internal deficiencies / failures) a Neoverse variant.
uArch improvements are judged per-generation so it’s actually comparable with everyone else.
You can skip Intel or AMD or Apple generations, too, and cherry-pick your way absurd “40% Gen over Gen improvement” numbers. 👌
Nobody is making 40% per year ST gains. 💀
If you can’t understand that, I’ll let you go. 🙏 Ampere is heavily MT-focused, so their goals are much more likely more cores / better core-to-core topologies / lower power.
//
Apple has been making “one-off” gains for the better part of a decade. It’s clear nobody else was apparently interested enough to even try.
You don’t think Intel wants to make big cores? Is AMD some anti-cache fanatic? Nope. They simply did not feel they were necessary and so they reap what they sow. 🤷♂️
Neoverse N1->N2 is 40% and N1->V1 is 50% gain in a single generation. You simply can't deny that N1->V1 didn't get 50% performance gain in 2 years. Nobody on here claimed that means 40% year on year or 50% every generation, you're just making that up and cherry pick your numbers.
Apple has been consistently ~2 generations ahead of everybody else. The gap is not increasing, so we are talking about one-off gains. If you slap enough cache around a Cortex-X1, clock it high on TSMC 5nm, it will reduce the gap significantly. An Arm slide suggests 20-30% gain is easy.
AMD/Intel cores are significantly larger than Neoverse cores. There is no doubt AMD's huge caches help a lot, but they are using ~3 times more silicon. Altra achieves incredible performance using a much smaller silicon budget. Future Arm servers could also move to chiplets and increase caches significantly. So again, that's a one-off gain.
Biased much? If anything getting 40% IPC gain at pretty much the same efficiency is incredibly impressive. It's normal in the industry for higher IPC cores to lose efficiency. I'm not sure how you can say Arm is running out of gas when on the same process you have 128 Arm cores using less power than 64 cores in Milan...
I'm just trying to take an honest look at the data. Maybe you should re-read that article's conclusion. It voices similar reservations about the N2's power consumption.
The point is not a minor one. This is a major and disturbing break in ARM's trend of advancing perf/W, and it's in their product line which is supposed to balance efficiency as an equal priority.
> I'm not sure how you can say Arm is running out of gas
...on the efficiency front! You win nothing by such blatant twisting of my words.
Is it really honest? N2's perf/W on the same process is just 3.5% lower than N1. That's not a big deal, a "disturbing break in ARM's trend of advancing perf/W" or "ARM could be running out of gas, on the efficiency front".
Now if the IPC gain was just 10% instead of 40% then you might have a point, but maintaining efficiency at such large IPC gains is extremely difficult and unheard of.
> N2's perf/W on the same process is just 3.5% lower than N1.
Okay, I had thought it was at 5nm, which was used for their other projections. I now see that the efficiency slide says "(ISO process & configuration)". So, if most/all implementations use 5 nm, then it should improve on perf/W, which should keep it competitive.
Well, I learned something I would've missed without this exchange, so thanks.
> maintaining efficiency at such large IPC gains is extremely difficult and unheard of.
Is it? Hasn't Apple shown good efficiency at even greater IPC?
> Neoverse N1 was announced in February 2019 with implementations late 2019. > Neoverse V1 was just announced with first implementations likely later this year. > So that's 50% performance gain in 2 years.
Which models and numbers you pick depends on your goal. If you just want to make ARM look as impressive as possible, then I think you're on the right track. However, what you've done is like comparing the machine learning performance of Comet Lake vs Ice Lake-SP. You could tout some insane year-on-year improvement, but that ignores the fact that they're in different product lines and were optimized (and priced) to do different sorts of things. Also, Comet Lake's micro-architecture is like 5 years old, even though it just launched last year.
So, go ahead and bicker over technicalities, like which CPU started shipping in which year. However, if the point is to establish some sort of performance trendline, in order to try and predict what sorts of improvements *future* ARM cores might provide, you're just leading yourself astray.
> that doesn't change the performance gain of Neoverse V1
But it uses a lot more power. So, it's not the typical way we're accustomed to looking at perf gains, where it's at most ISO-power. By ARM's own admission, it's 0.7x to 1x as efficient, which means 1.5x to 2.14x the power!
Also, 1.7x the area, which means > 1.7x the cost (at ISO process).
So, it's really disingenuous to talk about the V1 as an example of uArch gains. Again, your heavy bias is clear for all to see.
And even the N2 doesn't look so great, once the power estimates are taken into account. 1.4x the IPC at 1.45x the power (ISO frequency). It starts to look like ARM has finally hit a wall on efficiency.
No, in fact larger, faster cores are less efficient (just like large heavy cars are less efficient than small cars). Small in-order cores like Cortex-A55 are the most efficient.
So higher IPC typically comes at a cost in area and power. V1 is still smaller and use less power than the latest x86 cores. So I'm not sure what your point is?
And it is not disingenuous in any way to mention the fact that Neoverse V1 has 50% higher IPC. It's simply Arm's fastest core, you can't argue with that achievement. Should we ignore Milan because it is larger and less efficient than Rome? Milan is still a great achievement. Or would you argue it is not?
> larger, faster cores are less efficient (just like large heavy cars are less efficient than small cars).
yes.
> V1 is still smaller and use less power than the latest x86 cores.
But you didn't compare it to an x86 core. You compared it to the N1, which is like comparing a sports car to a previous year's family sedan, if we use your automotive analogy.
> it is not disingenuous in any way to mention the fact that Neoverse V1 has 50% higher IPC.
While it's an accurate repetition of a credible claim (i.e. something we can treat as a fact), it's what you're implying by it that makes it disingenuous. It's comparing cores in 2 different product lines, with different optimization targets. That's why it's called V1 and not N2. It's not a like-for-like comparison, which makes it difficult to infer much of anything from it. All it really tells us is how much faster ARM can make a server core on 7 nm, if they don't care as much about power or area.
No. Both Neoverse N2 and V1 are not only successors of N1 but their microarchitectures are related to N1. So it is completely reasonable to compare N1 with V1. And the timeframe and performance gains are matching Cortex-A76 to Cortex-X1.
> All it really tells us is how much faster ARM can make a server core on 7 nm, if they don't care as much about power or area.
And that refutes the original claim that Arm has ran out of steam in terms of IPC gains. Indeed, V1 uses more power and loses efficiency but that's the cost of pushing IPC hard. Hence the split in product lines as different licencees likely want different PPA targets.
> So it is completely reasonable to compare N1 with V1.
No, that's nonsense. The V1 is 33% bigger than the N2 and 70% bigger than the N1, and burns up to 2.14x as much power. A lot of its performance comes from the same places as the X1 vs. A78, as well as SVE. So, it makes as much sense as comparing a car with an 8-cylinder engine to a cheaper & more fuel-efficient 4-cylinder car of the previous model year, from the same manufacturer. One can certainly make such a comparison, but it's unclear what practical relevance it would have.
If you want to look at microarchitectural efficiency improvements, then you'd best focus on an apples-to-apples comparison, like A76 -> A78 or N1 -> N2.
> that refutes the original claim that Arm has ran out of steam in terms of IPC gains.
It would, if anyone had made such a claim. What I said was: "It starts to look like ARM has finally hit a wall on efficiency."
As that was before I noticed the 1.45x power figure was at ISO process with the N1, I'll walk that back a couple steps.
I think this move genuinely justifies Arm’s business model: make the ISA—not your architectures—the product.
Startup vendors can…
1. License stock core IP so anyone with money & some of silicon expertise can whip up chips. Start making money quickly and build a reputation (and customer list).
2. Then, if vendors feel confident and have the money & expertise, let us license the ISA alone. Keep all our current customers, sell them something even better and customized to their needs, and put us in near-total control.
It’s like the franchise mode on steroids: imagine if owning a Subway restaurant franchise let “upgrade” to an international logistics contract. Imagine the innovation markets it’d create.
If only Intel understood that 25+ years ago. Today, we get vague IDM “2.0” promises to allow some vague, non-committal licensing options in the unspecified future (and absolutely not the ISA because Intel shamelessly believes it does x86 the best).
Building a business based on Intel's alleged commitment to licensing sounds about as safe a bet as the companies that tried to license MacOS back in the 90s. Intel now, like Apple then, is being forced to adopt a model they really don't believe in. It's just not who they are. Apple ultimately returned to its true self and that has clearly worked out well for them. I have no idea if Intel can pull off something similar.
> if AMD increases the cores past 64 to near 80-90s then AMD will be on the moon.
Depends if their interconnect can continue to scale. Milan is really getting hurt by it, but some of that is due to the old process node of their IO die. Still, even on a newer node, more cores *and* higher speeds could continue to take a big bite out of their power budget. Meshes scale best.
If I'm reading the review for a deep dive look at the architecture, then it doesn't matter how long the product has been out. The only thing that makes it less relevant is if a newer architecture is written up.
Nvidia should have just bought Ampere, Nuvia or Cavium (back when those were available) instead of ARM.
Would have been cheaper and less risk. Now they pay a lot, have to wait long and face risk in both outcomes:
- They can't buy ARM and end with empty hands. Developing their own but need lots of talent...
- They can buy it and it kills ARM foundation in the long run as customers don't like it and move to RISC-V by next decade.
They could have even bought one of those ARM designers and on top try to work with VIA or IBM on a x86 license. I think they do not need and it is minimal sense but it would still be less risk and cheaper having both than buying ARM directly.
Nvidia did, once upon a time. They tried pretty hard to get it - that was what the project that ended up in the Denver cores originally started out as.
As we have seen with Apple's A series and now the M1 chips, rolling your own ARM cores can give you a LOT of performance uplift over ARM's standardized designs. If they want to be competitive on the top end with a compelling enough product to attract hyperscalers, they HAVE to have a world beating product. They weren't going to do that with an ARM default design.
How often in future and years Apple will show this advance before all other competitors and being recognized like a real advance for everyday usage patterns for customers (without comparing prices and/or software availability or compatibility)? Thunderbolt for e.g. is Intel engineering product.
Or it can be a flop afters years of investment, like Samsung's in-house cores, or a neat niche processor that doesn't justify the development costs, like Marvell's SMT ARM cores.
But you might not be wrong either. The V1 appears to be no slouch, but can it ever be massaged into something like Apple's cores?
I don't really consider Samsung's efforts with exynos to really be comparable here. At best, they seemed to be largely tweaks of the basic ARM IP, with design decisions that were tilted towards what they considered important for their use cases. It also appears to be the case that they had substandard communication between their CPU architects and their scheduler designers as some benches seemed to indicate.
I also still believe that the relative performance of the Apple A series cores is HEAVILY influenced by their total top down control of iOS as well. Both are fully optimized for each other in ways that no other architecture pairing can fully claim.
Apple's cores have higher IPC, thus higher single-thread performance than ARM's cores.
However, when made in the same TSMC process and when limited at the same power consumption, their performance is the same or worse as that of the ARM's cores.
For multi-threaded applications, which are the most important for servers, the power-limited performance is what counts. From the SPEC benchmarks published here at Anandtech I have not seen any advantage for Apple. The total performance was higher for Apple, but it was accompanied by a proportionally higher power consumption.
Apple's cores are without doubt better in a personal computer, where single-thread performance matters a lot, but until now there is no data showing that they would be better for a server CPU.
maybe also interesting, what interSoC connections allow multi vendor mainframes, optimized for routing data for its processing profile advantages (on cpu cache pipeline, hardware logics or ?PUs, on-site memory), what might support each vendors emphasis. What to expect on ARM side comparable to point-to-point processor interconnect like comparable to HTX (~3.1) or IF on networking workaround (advanced capability with additional hardware complexity) instead of direct "QPI"-likes
sorry, i didn't get the point considering multi socket SoC boards or backplane connectors towards mixed vendor data processing configurations (although interesting, even price profile for ThunderX/ThunderX2, considering Marvell's cancelling for ThunderX3's 'in favor of vertical markets and the hyperscaler server market'_Networkworld ?)
Don't forget you get 40% higher performance at the same frequency. If you reduce frequency by 10% N2 will be ~20% more efficient than N1 and still 30% faster. So a higher IPC at the same efficiency allows you to be far more efficient if required.
I think Ampere sees the writing on the wall. The problem with using standard ARM designs going forward is that it'll be a race to the (pricing) bottom, basically who can deliver the most cores for the buck. Ampere and the other independent ARM server CPU design houses cannot really differentiate on manufacturing; there are only two choices, one of them not as good - Samsung, so all want TSMC.That leaves a better architecture, custom made, to differentiate from the rest of the growing pack. I think Ampere tries to pull off what Apple was so successful at doing for mobile and now laptop/desktop use: create a better uarch than the one ARM has to offer. But, it's a risky move indeed. And, entirely without sources, here a thought: Maybe they found a sugar daddy to finance that move, I am thinking about a certain company headquartered in Redmond, WA. Would fit with their most recent announcements of customer wins.
The world is just not big enough to support this many ongoing CPU designs. You need a huge market to amortize your costs, and you have to keep doing it EVERY YEAR. You need three or four teams running in parallel each at different stages of the next designs. No way Ampere can maintain that; they can create their one core which may even be good -- but which won't be updated *substantially* till four years later... Meanwhile Apple is delivering annual 20..30% increases, and even ARM Ltd hopes to provide 15% or so annually. Nuvia probably saw that reality from the start, were mainly treading water for a year trying to execute on the "be acquired before we even ship a product" part of the plan.
Apple, ARM (for hyperscalers and random use cases that just want a core, any core) and QC (for mobile, desktop, "servers", and various non-hyperscaler warehouses) is probably all that's economically feasible.
Perhaps acquiring Ampere would be a nice choice for Nvidia, they will then have a complete portfolio for server processors. The ARM deal doesn't seem to be making progress anyway.
ARM has a large bulk of revenue (as well as a large bulk of market cap) coming from little loT processors l, while Nvidia has ambition only for big fat chips no smaller than phone SoCs. Frankly speaking Nbidia only needs high performance CPU to finish its server portfolio. Acquiring ARM is an overkill, finally technically and politically.
> Frankly speaking Nbidia only needs high performance CPU to finish its server portfolio.
No, they're making a huge push into self-driving cars and robots. They have a whole line of SoCs for that, which have been ARM-based since the time they tried to sell them into the phone and tablet market.
The N2 is suppose to be 40%+, that is a huge jump in terms of IPC. I would expect Ampere's custom core to at least out perform N2, otherwise why make a custom core?
But then the N2 is actually pretty decent core at only ~2W.
If memory serves, Fujitsu actually developed their new cores in collaboration with ARM because SPARC had pretty much run its course. In other words, I doubt that Fugaku has much SPARC DNA in it. That ARM-Fujitsu collaboration also resulted in the addition of SVEs to large ARM-based server cores; while I don't have an inside track, SVE seems to be something Fujitsu contributed heavily to
The genesis of SVE remains murky, but I doubt that Fujitsu had much to do with it.
If you look at Apple Patents, there's a large body of patents with the name of Jeffry Gonion on them, surrounding the "macroscalar" architecture. It's hard to be sure quite what this was supposed to be, but if you squint at it you can see various elements of SVE there - certainly predication, and the idea of "indefinite length" vectors.
I could imagine something like Apple talking to ARM around 2010, ARM saying "we have these ideas for the successor to NEON", Apple saying "we have these ideas for how to improve vectors generally", and a synthesis coming from those two pools.
According to whom? Granted, their perf/$ is bad, but I'm sure that's as much due to HPE as anything.
Anyway, I think the commercial play was probably just to recoup some of the development costs and get SVE-capable hardware in more people's hands (i.e. to benefit ecosystem support for SVE). I think the Japanese made it mostly to be self-reliant for their HPC needs, rather than for competitive reasons.
Has anyone checked out the $A$S tokens? Also, has anyone looked into buying domains (weed and crypto) ahead of legalization? I saw http://69pot.com had a big list of domains for sale and I’m thinking about buying one. Thoughts?
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
160 Comments
Back to Article
SarahKerrigan - Wednesday, May 19, 2021 - link
Well, okay. I'm utterly mystified by this.I used X-Gene and eMAG (never XG2 - never even saw one in the wild - but iirc it was a conservative modification of XG.) They were not stellar cores. Altra was a huge breath of fresh air, Neoverse has become legitimately very good and has a strong long-term roadmap... and they want to go back to custom microarchitectures?
deltaFx2 - Wednesday, May 19, 2021 - link
Because there’s nothing they do that ms, fb, google and others can’t do in house. And they probably are already doing so. They see the writing on the wall. Their best chance was to get acquired by a hyper scaler but seems like they could not. Trouble with custom cores is that they are up against competition doing cores for decades. I see trouble aheadSilma - Wednesday, May 19, 2021 - link
I would have thought Ampere would go the risc-v way, specializing in a few strategic IPs to keep a performance or performance per watt advantage on competition.mode_13h - Friday, May 21, 2021 - link
Going RISC-V right now would be getting out ahead of the market. Hyperscalers seem to be focused on x86 and ARM.FunBunny2 - Wednesday, May 19, 2021 - link
"Trouble with custom cores is that they are up against competition doing cores for decades. I see trouble ahead "the trouble with 'developing' any kind of thingee that's rooted in maths, and cpu (and all the bells and whistles moved on-chip) are surely maths, is that over time, and not always much of it, there is discovered One Best Solution. the mainframe hasn't changed in decades. OTOH, the 8086, in all its variations, was not the One Best Solution, but managed through pure luck and some questionable behaviour, to become The One Solution in microprocessors. in due time, if not already, there will be One Best Solution at the micro-arch level, too. moving evermore off-chip function on-chip isn't exactly any form of innovation, just incremental lock-in to the chip.
alfalfacat - Wednesday, May 19, 2021 - link
To be blunt, you have no idea what you are talking about. Modern chips are all about well-known tradeoffs along various dimensions, and there is definitely not One Best Solution. Your comment about mainframes not changing in decades is also just factually wrong - new Z mainframes are immensely different from last century's mainframes, they just have really good backwards compatibility.FunBunny2 - Wednesday, May 19, 2021 - link
"new Z mainframes are immensely different from last century's mainframes"it's still the same CISC architecture it always was. registers are wider and such hardware advances (pretty much just a microprocessor implementation), but just as 8086 remains the core, so to speak of Intel/AMD, so is the 360 for z. it's backward compatible just because it's the same. if it weren't the One Best Solution, then there'd be a host of other mainframe, aka COBOL, machines. Sun, among others, tried to take the mini/server construct to overthrow the IBM mainframe. none pulled it off.
dotjaz - Thursday, May 20, 2021 - link
>it's backward compatible just because it's the sameHow stupid are you? Modern x86-64 with all the extensions is nothing like 8086.
FunBunny2 - Thursday, May 20, 2021 - link
of course it is: it executes the 8086 ISA. it also executes some additional instructions. same for the 360 to z. does the current 86 cpu have a unique ALU not available in 1978? does it not execute original 16 bit instructions? of course it does. 99.44% of the 'innovation' since the 360/30 or 8086 has been in widening registers, piling on on-chip caches, and pulling off-chip functions on-chip. the killing off of the Seven Dwarfs and, later, the non-86 ISA microprocessors (Power survives; ARM is still up in the air) has led to a computer monoculture. well, duoculture.backward compatibility is almost entirely about OS stability, since that's what applications know about. keep the system calls stable, and your machine is backward compatible. that's the reason you can run 360/DOS code on today's z; not that you would want to. OTOH, up to a few years ago, dedicated 360 code was still running; it ain't broke so don't fix it.
OTOH, as is well known, any machine with a C complier can run *nix, IOW at least one OS (well, a few of them) don't much care about the ISA.
arashi - Friday, May 21, 2021 - link
I was confused to see your name one day when I was flipping through the dictionary, until I realized your name turned up as an example for Dunning-Kruger.mode_13h - Friday, May 21, 2021 - link
Leaving aside the rest of your questionable claims...> any machine with a C complier can run *nix
You need a MMU for a proper UNIX-like OS. A lot of DSPs and microcontrollers that are C-programmable don't have one. So, no. It takes more than a C compiler.
persondb - Friday, May 21, 2021 - link
You know absolutely nothing about what you are talking about. The original 8086 didn't have a shit ton of stuff in it.What you are just arguing is that since the model of computation is the same as it was over 100 years ago(i.e. Turing Machines/Lambda Calculus), then nothing has changed. And also ignoring branch predictors, superscalar cores, out of order execution, etc etc. Or that even modern x86 cores might not even be technically x86 anymore as they decode the ISA into macrops and those into microps that might be arguably an 'internal ISA'.
mode_13h - Sunday, May 23, 2021 - link
> The original 8086 didn't have a shit ton of stuff in it.Yes. The UNIX security model depends on memory protection, which x86 lacked in any form until the 80286. However, I think x86 didn't have a proper MMU until the 80386.
FunBunny2 - Monday, May 24, 2021 - link
one can, and MicroSoft did, build a business on a Real Mode *nix, with a software patch to 'emulate' protected multi-user mode (I don't recall a cite, boo hoo). they called it Xenix, and it was built to the 8086, later others. until Kildall blew off Armonk's ask for a CPM/86, who then went to MicroSoft for a control program for the 8088, MicroSoft had no interest in anything but *nix. as usual, Bill lied and said, 'sure, we've got one of those'. so he conned DOS from Seattle Computer and the rest is history. the 8086 was only a real mode processor, which required the patch to run Xenix multi-user. the 80286 supported both real and protected mode, alas, it couldn't return to real mode from protected mode except by cycling. the 80386 was the start of something Big. as to moving all those other functions on-chip, that didn't happen as innovations by the microprocessor designers, but solely because the hardware folks gave them increasing transistor budgets. all of the functions mentioned had existed for a couple of decades on mainframes (and some minis, notably DEC), which already had those transistor budgets, and price points to match.mode_13h - Monday, May 24, 2021 - link
> Real Mode *nixReal Mode didn't exist until they added Protected Mode.
> with a software patch to 'emulate' protected multi-user mode
You could do multi-user, but not protected mode. 8086/8088 had no memory protection. Period.
> all of the functions mentioned had existed for a couple of decades on mainframes
IBM did things in the 1960's that most people think weren't invented until the 1980's.
FunBunny2 - Tuesday, May 25, 2021 - link
"IBM did things in the 1960's that most people think weren't invented until the 1980's."and it wasn't, mostly, IBM who did the innovative things. Burroughs implemented virtual memory in the B5000 in 1961, years before IBM. the other 6 dwarves invented most of the other stuff IBM gets credit for. well, may be not RCA; they ended when they got caught 'cloning' the 360.
Timoo - Thursday, May 20, 2021 - link
No, there's not.9 + 5 = 5 + 9 and are 2 different ways to come up with 14, equally "best".
5 + 5 + 4 is longer, but might have a less complex structure in design, ending up being able to clock higher, for instance.
So, no, most mathematical problems have several solutions, all of them being correct. The application of it determines which method works best. In short: If you take into account the whole chain of commands, you can only try to find the best solution for that specific use-case. Which might be a disaster on a different chain of commands.
Timoo - Thursday, May 20, 2021 - link
ps. To make it even more complex: 5 + 5 + 5 - (1/5 * 5) gives you 14 as well.And although it requires a lot more instructions, that might be the trade-off for a much simpler design. If you can have 4x as much compute power in your design, while doubling the amount of instructions, you are still twice as fast.
Thargon - Saturday, May 22, 2021 - link
I think you just discovered the reason, why computers perform binary calculations only ;)FunBunny2 - Thursday, May 20, 2021 - link
"most mathematical problems have several solutions"not for a sufficiently rigorous specification. that's the point.
alfalfacat - Wednesday, May 19, 2021 - link
I think this must be it. If you're an pure IP house in the business of packaging someone else's IP, well you could make a decent systems integration business out of selling to smaller firms but not to hyperscalers who have more scale than you. I suspect enterprise/smb customers are sticking with x86, so you're getting squeezed out of both ends of the market.I agree that this is a surprising move, one that I don't know will pan out, but their alternative looks grim.
Yojimbo - Wednesday, May 19, 2021 - link
I think you are right. Why would a hyperscaler buy ARM-designed cores from Ampere when it can do the same modifications itself, except more tightly tuned to its individual needs? At the same time, the time frame for (potentially) selling ARM cores to enterprises is far beyond the financial viability of Ampere. So for Ampere it's a matter of survival to try to design its own cores. Plus having its own cores certainly makes it a better takeover target than having a business based on designing around ARM cores. Engineering teams can be hired away. IP stays with the company. Cf. Apple and Imagination Technologies. Imagination ended up being acquired by a Chinese company and Apple went back to licensing IP from it even after Apple had hired away a lot of its engineers. Ampere's hope is to develop and show something worth being acquired for, like Nuvia did.Matthias B V - Wednesday, May 19, 2021 - link
I agree.With most comanies starting own designs they just work on trying to increase value fast enough and then get sold for IP and especially Talent by a big player to extend their own capacities and talent.
serendip - Thursday, May 20, 2021 - link
Interesting. That's what Amazon did to get Graviton, buy a startup working on customized ARM cores to bring that expertise in-house. Microsoft could end up doing the same with Ampere but with the added bonus of having ARM-compatible custom cores like what Apple has. It's still important for ARM to continue licensing out core designs as a backup plan because any of these custom cores could go the way of Exynos.Thala - Wednesday, May 19, 2021 - link
Thing is, their engineering team is doing cores for decades as well - Intel x86-64 cores to be precise. At the moment, they just have to convincingly beat the x86 competition, which should be doable - beating other ARM offerings is of course going to be harder.mode_13h - Friday, May 21, 2021 - link
Yeah, they have to try and differentiate themselves from everyone else who's just taking ARMs stock IP. If custom doesn't work, they could always revert to using ARM's stock IP, but the big money is in substantially beating ARM, for at least a few market niches.I think they can probably do it, too. Especially if Neoverse continues to be just mobile cores on steroids.
Raqia - Wednesday, May 19, 2021 - link
Less reliance on ARM IP should nVidia consummate its deal is a plus for them too.dotjaz - Thursday, May 20, 2021 - link
There's absolutely no chance Nvidia can buy ARM. US will have to lift all the semi-conductor bans and promise never to do so again for China to approve the deal.Raqia - Thursday, May 20, 2021 - link
I agree with that. In addition, the disaffected CEO of ARM china is resisting the deal and seems to have the support of local authorities as well.sgeocla - Wednesday, May 19, 2021 - link
Milan is already out yet they are still comparing their next gen 2022 product against AMD previous generation EPYC Rome. By the time this launches they will compete with AMD EPYC Genoa with 50% more cores on 5nm so they are almost back to where they are now with 80 ARM cores vs. 64 x86 cores.If Ampere managed to learn anything from EPYC ramp over Xeon is that hey need to provide huge benefits and have a proven track record to gain market share, not just barely compete.
SarahKerrigan - Wednesday, May 19, 2021 - link
Altra Max is a 2021 product.sgeocla - Thursday, May 20, 2021 - link
Yeah, sure. Ampere is going to get 60% more cores and everything else needed to support them (memory, IO) for free and it will work across all workloads, not only the cherry-picked ones on the slides.The 80 core Altra is already monolithic and if they remain monolithic at 128 cores then they yields are going to be cut down dramatically. It it's not monolithic than every latency workload is going to get severely affected. There's no free lunch.
Wilco1 - Thursday, May 20, 2021 - link
Remember once again, this is targeted at cloud workloads, so not all workloads are expected to improve. Compute intensive workloads will see major increases but bandwidth intensive workloads won't until DDR5. If AnandTech runs the same set of benchmarks, Altra Max should beat Milan in all except for SPECjjb.They will have great yields - with so much silicon for cores, most defects will be in the cores. Those are sold as parts with 112 and 96 cores.
Spunjji - Friday, May 21, 2021 - link
Exactly that. With so many small-ish cores and accompanying lumps of cache memory, handling yield issues will look similar to how they did things on late-2000s-era GPUs.Yojimbo - Wednesday, May 19, 2021 - link
Besides Altra Max being a 2021 product, as SarahKerrigan already pointed out, even in Q2 AMD expects more Rome sales than Milan sales. AMD expects both Rome and Milan to be selling right through to the end of 2021 and "by the third quarter that it would cross over and Milan would perhaps be higher than Rome." So a comparison to Rome is certainly appropriate. As for leaving Milan out, that could be inappropriate, but it may be appropriate to the segment of the market Ampere is pursuing.BTW, why'd you point out the Rome thing and not Cascade Lake?
sgeocla - Thursday, May 20, 2021 - link
They could have chosen Milan which is available for some time and can even be benchmarked on the hyperscalers instances but then even their current Altra would look just behind in their slides. Why buy and test the Altra when it's already behind even in the cherry-picked benchmarks?Ice Lake launched later and Intel only sold like 150k units until launch so they have some excuse for not benching against it, although I'm sure that they could have gotten their hands on one if they really wanted to. Heck, even youtubers manage to get parts under NDA that are not provided to them on launch day, but somehow a company as big as Ampere can't manage to get their hands on a single sample.
Wilco1 - Thursday, May 20, 2021 - link
It's not obvious at all Milan would do better than Rome in those particular benchmarks. Even on SPECINT_rate Milan is disappointing as the large IPC increases of Zen 3 are not showing up. In fact the gains are mostly due to increasing power consumption by 25%. If you limit Milan to the same TDP as Rome, it ends up SLOWER as AnandTech showed here: https://www.anandtech.com/show/16529/amd-epyc-mila...Contrast that with Altra Max which runs 60% more cores at the same TDP.
dotjaz - Thursday, May 20, 2021 - link
> So a comparison to Rome is certainly appropriateNo it's not, by your own admission in Q3 where Ampere Max would certainly not be selling at any more than Quicksilver, Milan would outsell Rome, why would it be appropriate to compare your products to competitors' older and lower volume products when new products have launched for some time and is expected to be the main competitor?
Spunjji - Friday, May 21, 2021 - link
Because the performance difference between Rome and Milan in these benchmarks will be small-to-non-existent. There's no need to take our word on this, you can happily wait and see.Wilco1 - Wednesday, May 19, 2021 - link
Ampere Altra already matches (and often beats) Milan as per AnandTech's review. Altra Max opens a huge gap over Milan, and so should match Genoa as well. Ampere's next generation will then increase the gap even further. Core count might increase beyond 128 as suggested in the slides. That's not barely competing but leading performance at significantly lower cost and lower power.sgeocla - Thursday, May 20, 2021 - link
Aren't you going to mention any of the workloads where even the 28 core 8280 beats the 80 core Altra? And now there's the 40 core Ice Lake version and Sapphire Rapids incoming.AMD hasn't been able to ramp up even with 2x performance gap over Xeons because it's not just about superior performance. The platform needs to be stable (they just destabilized it by going custom with in-house cores) and engineering support must be there. Customers aren't going to wait months for their tickets to be solved just because Ampere doesn't have the thousands of engineers necessary to support their ramp.
Wilco1 - Thursday, May 20, 2021 - link
There was only one workload where Altra did really badly (SPECjjb), but in almost all other workloads (SPECINT_rate, LLVM compile, NAMD) even 2x 8280 was not able to match a single Altra. In SPECFP_rate 2x 8280 was barely faster than 1x Altra. The incoming Icelakes have to compete with Altra Max.mode_13h - Friday, May 21, 2021 - link
> in almost all other workloads (SPECINT_rate, LLVM compile, NAMD)> even 2x 8280 was not able to match a single Altra.
You're perhaps focusing too much on highly-scalable workloads. Both brands of x86 CPUs are still well ahead in per-core (if not per-thread) performance.
Wilco1 - Friday, May 21, 2021 - link
Altra beats the 7742 on per-core performance and is only 6% behind 7763: https://images.anandtech.com/graphs/graph16529/119...As for per-thread, Altra has 62% higher performance than 7763! So per-thread performance is absolutely awful on x86, I don't see how you can possibly claim x86 is ahead with a straight face.
mode_13h - Friday, May 21, 2021 - link
> As for per-thread, Altra has 62% higher performance than 7763!> So per-thread performance is absolutely awful on x86
Yeah, but you surely know that's an apples vs. oranges comparison. The marginal cost of the hyperthreads is like 10% more area. So, it ends up being a win to add SMT. Otherwise, they wouldn't do it.
And let's not forget that, when you're comparing per-thread performance, the 7763 has 128 threads vs. only 80 threads in the Q80-33. So, the concept of having 2x as many threads that are less performant is cut from the same conceptual cloth as Ampere's idea of using a larger number of slower ARM cores. You can't really criticize the SMT of x86 CPUs without implicitly criticizing ARM and Ampere's approach of making up for the slowness of their cores by using more of them.
And AMD nets 12% better aggregate performance and a win on SPECint2017 Base Rate-N from the technique. If the shoe were on the other foot, you'd be trumpeting how smart a move it is for ARM to be doing it.
But we can all now see you're not here to have a good faith discussion of the relative strengths and weaknesses of these CPUs. You're cherry-picking and spinning literally every single thing you can, to make Altra look like it dominates, when it's merely competitive.
Wilco1 - Saturday, May 22, 2021 - link
Assuming SMT adds 10% area on x86, getting just 12% better throughput is not a great advertisement for SMT. There are of course scenarios where SMT does better, but I don't believe SMT makes much sense in servers. With Arm's smaller cores, it's better to add more cores.When Altra Max benchmarks are out, you will see that adding real cores works out better than adding SMT. Altra Max will outperform Milan by a good margin at lower power. Its performance per thread will also be higher than an SMT thread on Milan.
If you still didn't get the importance of the historical achievement of a licensed Arm server core beating Intel/AMD's latest and greatest, Altra Max will surely do it.
deltaFx2 - Saturday, May 22, 2021 - link
Intel states smt area and power cost is less than 5%. So it’s almost for free for 2way smt. https://software.intel.com/content/www/us/en/devel...Moreover smt can be turned off if your application does not need it such as some cases of HPC.
And dream on… if adding cores by cutting caches results in higher per thread performance then caches have no merit. Which is utter nonsense. Altra 128c is rumored to have 16MB total LLC to make room for more cores. altra 128c is spec rate accelerator not a general purpose server. See spec jbb results. Altra is calxeda redux: posts high spec rate scores, with crap latency performance under load.
mode_13h - Sunday, May 23, 2021 - link
> Intel states smt area and power cost is less than 5%.Thanks. I don't know where I got the 10% figure. Maybe I imagined it, or maybe it dates back to simpler core designs.
Wilco1 - Sunday, May 23, 2021 - link
Those kinds of claims are likely as trustworthy as 10nm status claims - without a detailed paper explaining which buffers were duplicated, how many bits were added to all the cache structures to differentiate between threads, it's hard to know how it was computed.Some Arm core have SMT so the cost on Arm design is known (but not public). Those cores haven't been big successes, and no new SMT cores have been announced since, so that suggests it is not as easy or cheap as claimed. Given the myriad of security issues with SMT, I'm not expecting future designs to have SMT either.
Read what I said: Altra Max will beat Milan on overall performance and per-thread performance. Adding more cores will of course reduce performance per core due to cache/bandwidth limitations, but throughput still goes up significantly. Altra Max is special purpose like A64FX - with Arm you will see different designs targeting specific niches.
mode_13h - Monday, May 24, 2021 - link
> that suggests it is not as easy or cheap as claimed.Or, maybe just that it doesn't make as much sense on mobile, which is still ARM's top priority. So far, all their server cores are still derived from mobile ones.
> Given the myriad of security issues with SMT
This is another potential explanation why ARM didn't bother to pursue SMT. However, recent kernel work by Google & others have eliminated SMT-related security issues by optionally preventing threads from different processes from sharing a physical core.
> Altra Max will beat Milan on overall performance and per-thread performance.
If they indeed accomplish that by further reducing LLC, then it could be somewhat of a Pyrrhic victory. It will be telling if a number of benchmark scores actually *drop*, relative to the 80-core model.
mode_13h - Sunday, May 23, 2021 - link
> Assuming SMT adds 10% area on x86, getting> just 12% better throughput is not a great advertisement for SMT.
It's not. Now, I don't know if they actually disabled HT or just relied on the thread scheduler to execute only one thread per core. Switching to the 1S results should control for that a little better, in which case EPYC 7763 gets a 16.2% benefit. The 7742 gets a 15.4% benefit, and Xeon 8280 gets a 16.8% benefit.
FWIW, my experience compiling software has shown a much bigger benefit from SMT. So, that suggests the amount of benefit is highly workload-dependent. This makes intuitive sense, since we know that the amount of ILP in code & its cache hit-rate varies quite widely.
> I don't believe SMT makes much sense in servers.
It works well for GPUs!
> When Altra Max benchmarks are out, you will see that adding real cores works
> out better than adding SMT.
Could be, though PPA (and therefore perf/$) will be lower than the 80-core versions. And those are Altra's biggest selling points.
Anyway, I'm a lot more excited to see the V1 benchmarked. It's going to be interesting to see just how well 2x256-bit SVE measures up against AMDs 4x256-bit and Intel's 2x512-bit vector engines.
> If you still didn't get the importance of the historical achievement ...
I get it.
I expected N2 would be the next nail in x86's coffin, but its ~3% worse efficiency has me wondering if it will really quicken the pace of the ARM transition.
I was also a little disappointed to find nothing revolutionary in ARMv9. I hope ARMv10 is a more fundamental rethink of the ISA.
mode_13h - Sunday, May 23, 2021 - link
> And those are Altra's biggest selling points.Sorry, that was an incomplete thought. The Max should have better perf/W, which is probably Altra's biggest selling point. Depending on how much better, it could easily overcome the perf/$ difference.
deltaFx2 - Monday, May 24, 2021 - link
in this paper, google claims the effect of 2-way SMT is ~40% on haswell (much higher on power8, but power8 has worse single threaded frequency and is designed to be a >2-way SMT machine). See Fig 2b: https://web.stanford.edu/~kozyraki/publications/20...In this older paper, they report significant uplifts from SMT as well. https://static.googleusercontent.com/media/researc... See Fig 14. Quote "core throughput doubles with two hyperthreads".
Just because ARM doesn't do it doesn't mean it's a bad idea.
deltaFx2 - Monday, May 24, 2021 - link
Mistake: single threaded frequency -> single threaded performancemode_13h - Tuesday, May 25, 2021 - link
> in this paper, google claims the effect of 2-way SMT is ~40% on haswellI think I get at least that much benefit, when compiling code.
Spunjji - Friday, May 21, 2021 - link
"they just destabilized it by going custom with in-house cores"Wut? That's a heck of a prediction.
Spunjji - Friday, May 21, 2021 - link
Also, nobody's going to buy Altra for workloads where it loses to 28-core Xeons... I don't think anybody's under any illusions about that.arashi - Friday, May 21, 2021 - link
It's about predictability. If they still used Neoverse designs, ARM announces PPA gains much earlier, and customers can do their own napkin math on the perf/core count for the next generation of Ampere, and plan accordingly.dotjaz - Thursday, May 20, 2021 - link
Are we reading the same review? Where did it say Altra outright matches Milan? It can't even guarantee to match Rome."The Altra Q80-33 sometimes beats the EPYC 7742, and loses out sometimes – depending the workload. The Altra’s strengths lie in compute-bound workloads where having 25% more cores is an advantage. The Neoverse-N1 cores clocked at 3.3GHz can more than match the per-core performance of Zen2 inside the EPYC CPUs.
There are still workloads in which the Altra doesn’t do as well – anything that puts higher cache pressure on the cores will heavily favours the EPYC"
Wilco1 - Thursday, May 20, 2021 - link
Did you actually look at the Milan vs Altra results? Altra matches Milan on SPECINT_rate and beats it on NAMD. LLVM compile is close, the only one Altra doesn't do well on is SPECjjb.mode_13h - Friday, May 21, 2021 - link
> Did you actually look at the Milan vs Altra results?SPECint2017 Rate-N (Single Socket):
EPYC 7763 (NPS4): 4 wins
Altra Q80-33 (Quad.): 5 wins
Ties: 1
SPECfp2017 Rate-N (Single Socket):
EPYC 7763 (NPS4): 5 wins
Altra Q80-33 (Quad.): 5 wins
Ties: 2
So, to say that "Ampere Altra already matches (and often beats) Milan" is not accurate. It often matches or beats Milan, but Milan still pulls a few big wins and edges it out in several other cases.
And if you look at the aggregates, Altra really gets its ass handed to it by Milan on SPECfp2017 Base Rate-N. Kinda funny how you never mentioned fp.
https://www.anandtech.com/show/16529/amd-epyc-mila...
Wilco1 - Friday, May 21, 2021 - link
If you had read the Altra articles you would know Ampere explictly aims Altra at integer cloud workloads and even down tuned floating point for better integer performance. Hence integer is much more important than FP.Still, it doesn't do bad at all on fprate: the 7742 is only 2% faster, the 7763 is 15.5% faster.
So you agree Milan really gets its ass handed to it by Altra on NAMD since it beats Milan by 16.7%?
mode_13h - Friday, May 21, 2021 - link
> Ampere explictly aims Altra at integer cloud workloads and even down tuned floating> point for better integer performance.
You're saying they *customized* the N1 cores in Altra to have less floating-point performance?? Citation needed!
> So you agree Milan really gets its ass handed to it by Altra on NAMD since it beats Milan by 16.7%?
Sure. The data is reported as you say, although I'm unclear on their point about multicore vs. MPI and whether that means they ran different modes on the CPUs.
Also, I wasn't the one trying to characterize the relative performances of these CPUs, so it contradicts nothing I've said or implied.
Lastly, it's one test case vs. 22 in the SPEC2017 benchmarks. I think the fact that it's reported as a separate result is over-weighting its importance, whereas it should probably be viewed on par with any of the individual SPEC2017 tests (or perhaps less, considering their misgivings about it's core-local nature and the build inconsistencies).
deltaFx2 - Saturday, May 22, 2021 - link
Ampere altra is perfectly fine for latency insensitive workloads like webhosting. But if you care about tail latencies, the specjbb result makes its failings obvious. This has less to do with arm and more to do with amperes poor architecture choice of insufficient LLC. Now obviously if raw single thread performance matters, a lower core count x86 is your best bet. See https://www.anandtech.com/show/16529/amd-epyc-mila... for per virtual cpu perf. Note the F optimized part with smt off on the top of the list. More cache on the ampere part wouldn’t have done much for spec rate but would have avoided the obvious glass jaw that spec jbb demonstrates. If you know for sure that you are only ever going to run latency insensitive batch jobs, ampere is more than adequate. If tail latency matters to you, the spec jbb data point indicates a huge red flag. A 2P altra loses to 1P milan on tail latency. And this is amperes fault for wanting to win on spec rate at all costs. If you don’t know what might be running, a sane person would prefer the design that has fewer glass jaws. Google has made much of the importance of tail latency as a reason why fewer beefier cores are better than a flock of weak ones. I suspect a lot of cloud workloads are like that but I have no data other than google papers. But yeah if you have an application specific server like web hosting, altra is adequate.mode_13h - Sunday, May 23, 2021 - link
That's a very informative wall of text. A couple paragraph breaks would make it easier to digest.deltaFx2 - Saturday, May 22, 2021 - link
Also what is namd representative of? It doesn’t predict spec fp rate despite being a part of it. Due to its tiny working set of under an MB, smt likely hurts it but AT folks don’t report an smt off score. So sure it’s something but it’s about as representative as AMDs favorite cinebench. I think the AT test suite for arm is biased towards things that can be built and run reliably and fairly on both platforms. Perhaps that explains its inclusion despite commentary from them discounting its utility as a benchmarkWilco1 - Sunday, May 23, 2021 - link
I agree it would be great to see more server benchmarks. ServeTheHome did LLVM compile, C-Ray, 7-zip, MariaDB and NGINX.Wilco1 - Sunday, May 23, 2021 - link
> You're saying they *customized* the N1 cores in Altra to have less floating-point performance?? Citation needed!Absolutely, and not a little either. SPECINT and FP scores are normally quite balanced, for Graviton they are within 7%: https://images.anandtech.com/graphs/graph15578/115... . However on Altra they differ by 37%. So Graviton 2 has effectively 28% faster FP than Altra despite both using the same N1!
There are many different settings to customize Arm cores, one thing that was mentioned by AnandTech was disabling of advanced prefetchers: https://www.anandtech.com/show/16315/the-ampere-al...
mode_13h - Sunday, May 23, 2021 - link
> > You're saying they *customized* the N1 cores in Altra to have less>> floating-point performance?? Citation needed!
> Absolutely, and not a little either.
Again, I'm looking for specific claims that Ampere did something to the N1 cores in Altra that directly reduced their FP performance. Not benchmark results, which could have many explanations.
> There are many different settings to customize Arm cores
But not the number of FP pipes, issue ports, or latency! Prefetchers *could* have disproportionately impacted FP benchmarks, but they're not specific to floating-point and can also hurt integer code.
dotjaz - Thursday, May 20, 2021 - link
With the Milan review, the only thing Q80-33 was able to win is NAMD. All others ranges from barely kept up to losing by a HUGE margin. 2xQ80-33 can't even match 1 EPYC 7742 in MultiJVM critical-jOPS and only 27% better max-jOPS, that's 160 cores/2 sockets vs 64 coresWilco1 - Thursday, May 20, 2021 - link
Besides NAMD, Altra also wins most benchmarks in SPEC_rate - I count 12 wins in 22 int+fp benchmarks, often by a good margin. Altra Max will do even better. The only big loss is SPECjjb (and likely will be only loss for Altra Max) - I guess Ampere is not optimizing for that workload.deltaFx2 - Thursday, May 20, 2021 - link
Namd happens to have a tiny data cache footprint of less than an MB. Not surprisingly, it scales linearly in the number of cores. In every review of late, the AT authors say they don’t like this benchmark but they post it regardless for reasons I don’t understand.Ampere altra is designed to “win” in spec rate at all costs. Rumors on the internet (somewhat credible) indicate that they cut the L3 cache to 16MB to fit 128cores. 60% more cores apparently give 19% more perf. See how nonlinear that is? I told you so a while ago… spec rate scales in number of virtual cpus until you run out of dram bandwidth. Cache capacity does not matter. Just add cors and watch it fly. Memory bw saturation has already been hit around 64c beyond which additional cores become less effective. Milan is also affected by this, where higher IPC does not buy as much throughput in 64c configs. Ampere crippled themselves further by removing cache and increasing dram pressure. All to win at a benchmark that is maybe vaguely representative of latency-insensitive homogeneous batch processing.
Spec jbb, on the other hand is a latency sensitive server load and amperes 80c design in a 2p config is barely ahead of a 1Pmilan in the AT review. I expect the 128c version to suck even more but let’s see.
mode_13h - Friday, May 21, 2021 - link
> I count 12 wins in 22 int+fpAh, so *now* you mention fp, when you can skirt around its big *loss* on the aggregate score.
And just to reiterate, my count is 10 wins + 3 ties for Altra.
I don't pretend to know why, but you're really spinning Altra's results pretty hard. At this point, I'd say you have a clear bias.
Wilco1 - Friday, May 21, 2021 - link
If anything, you have a clear bias. You are the one who falsely claimed that x86 wins both on per-core and per-thread performance despite clear evidence of the contrary.mode_13h - Friday, May 21, 2021 - link
> You are the one who falsely claimed that x86 wins both on per-core and per-thread performanceNo, I explicitly said it wins on per-core performance, NOT per-thread. Learn to read, please.
ballsystemlord - Wednesday, May 19, 2021 - link
Pity we'll only be able to play with the new arch on servers. Other than that, it sounds very interesting.Kangal - Thursday, May 20, 2021 - link
I agree.I think we need someone like ASUS to start selling motherboards that take in a new CPU-Socket. However, this one is designed for ARM processors, and the socket details are open-source for any company to design for. Let's say it scales from 1-core upto 32-cores, and the kit is reasonably (USD $200) priced. On top, it comes with an open-source Stable Debian OS.
Now everyone can get their hands on an ARM kit, and do development straight on the product. Wether that's a Qualcomm chip, Samsung, HiSilicon, MediaTek, Rockchip, Unisoc, VIA, AMLogic, or Allwinner doesn't matter.
And I suspect very quickly we would see an official Windows10-ARM build, AndroidOS, Custom ROMs, and lots of Linux Distros. Perhaps even a Hackintosh. With this comes the mainstream transition of "big programs" that usually live in Windows, to get officially ported to ARM, without the need for dumbing things down for mobile.
It would be extremely handy for companies to get their hands on this kit, mainstream, with good support, cheap pricing.... to basically just play with it, and find out if their software (or solutions) are adequately fulfilled. I suspect for most it would be. Which makes it much easier decision for them to transition away from x86-server to an ARM-server. They could potentially change out their Server-Connected Tiny Office PCs (eg NUC) to be running from a small ARM box as well. If or when this happens, we should see IT/Software Professionals getting better and better work market, and the older AMD/Intel duopoly having to massively cut retail costs (-70% ?) to remain profitable.
arashi - Friday, May 21, 2021 - link
What, and get rid of product lock in, and have to compete fairly? You must be drunk to even think the industry players would even consider it.Kangal - Sunday, May 23, 2021 - link
Alright, alright, that made me chuckle.PS: I know about the Altra Workstations, but they're more of a niche from a niche company. And priced ridiculous, with lots of lock-in options. I wanted something that's reasonable, well-priced, aimed at the mainstream, by a reputable mainstream company (eg ASUS, Gigabyte, MSI, etc etc).
mode_13h - Monday, May 24, 2021 - link
> I wanted something that's reasonable, well-priced, aimed at the mainstream,> by a reputable mainstream company
Agreed. Even if one of these guys would just take a Qualcomm 8cx and put it in a SFF PC, at least that'd be a good step up from a Pi v4.
mode_13h - Friday, May 21, 2021 - link
> I think we need someone like ASUS to start selling motherboards that take in a new CPU-Socket.How about Gigabyte?
https://www.gigabyte.com/Industry-Solutions/5G-Dat...
mode_13h - Friday, May 21, 2021 - link
> And I suspect very quickly we would see ... lots of Linux Distros.Already there.
mode_13h - Friday, May 21, 2021 - link
> They could potentially change out their Server-Connected Tiny Office PCs (eg NUC)> to be running from a small ARM box as well.
I hoped we'd see Qualcomm 8cx in a SFF desktop enclosure, by now.
mode_13h - Tuesday, May 25, 2021 - link
> I hoped we'd see Qualcomm 8cx in a SFF desktop enclosure, by now.Wow, almost foretelling the future, here: https://www.anandtech.com/show/16697/qualcomm-show...
ballsystemlord - Monday, May 24, 2021 - link
You're missing one big problem with your theory. Different chips have different amounts of I/O and different pin-outs. Therefore, a universal MB socket would be unattractive and you'd get lots of "supports up-to" statements when purchasing.ballsystemlord - Monday, May 24, 2021 - link
Here's a project that is trying to do something similar:ballsystemlord - Monday, May 24, 2021 - link
So, linking to crowd supply is obviously spam. Ugh! I've posted a wiki article for you instead. You'll have to do a quick google for the actual EOMA68 project.ballsystemlord - Monday, May 24, 2021 - link
https://elinux.org/Embedded_Open_Modular_Architect...mode_13h - Friday, May 21, 2021 - link
If these guys made Altra workstations, why do you presume they won't also make them with Siryn CPUs?https://store.avantek.co.uk/ampere-altra-64bit-arm...
ikjadoon - Wednesday, May 19, 2021 - link
This is genuinely great to hear. If even Apple, a lifestyle company in California, can commit engineers, resources, and time to custom uarches for consumer-only tech gadgets, why not everyone in the hyperscaling market?Honestly, though, if your microarchitecture roadmap tells the world it’s slowing down its cadence, people believe you. Would you want to tie yourself to a roadmap that all but confirms “don’t expect any big leaps” soon?
https://www.arm.com/company/news/2020/10/pushing-t...
Neocerse’s “big” numbers (“40% higher ST performance N2 vs N1”) are between three generations: it’s comparing a 2018 uarch with a 2021 uarch (12% ST gain per year / generation).
And Makalu + Matterhorn are only promising 14% ST improvement per generation.
But. It’s not easy for Ampere, either. Ask Intel or AMD. It’s not hard to hit a uarch performance drought versus your competitors.
Wilco1 - Wednesday, May 19, 2021 - link
The thing is, that graph is trying to be honest. Nobody else can sustain such large yearly IPC improvements. Moore's law is still holding on, but don't expect large frequency uplifts from new processes either. So everybody is in the same boat.Your math is off as well - Neoverse N1 is from 2019 and Neoverse V1 is in 2021, so that's 50% uplift in 2 years. N2 is a later, efficiency optimized core with slightly lower performance, but that doesn't change the performance gain of Neoverse V1 - unless you're trying to downplay the significant performance gains made by recent Arm cores...
ikjadoon - Wednesday, May 19, 2021 - link
Nope, the math is perfect. The N1 uarch is based on the 2018 A76. See AnandTech’s chart, designed by Andrei himself:https://www.anandtech.com/show/16640/arm-announces...
The Cortex A77 was released in 2019 and it has not been used a single Neoverse CPU.
//
And then after Matterhorn and Makalu? Surely NUVIA, Apple, Qualcomm, Ampere, and everyone else have seen Arm’s stock core roadmap. All of them used Arm cores for a few generations (NUVIA excluded) and went to custom.
You don’t need to beat Arm by 50%. Even a 10% IPC win versus Arm per year adds up.
Arm stock microarchitectures aren’t bad, but they’re not performance leaders. Likewise, It’s not just IPC, but perf-per-watt, too, as Ampere wrote.
Samsung, NVIDIA, Amazon are all hanging on, but I don’t expect any performance leadership from any of their designs.
Wilco1 - Wednesday, May 19, 2021 - link
No your math is way off. Neoverse N1 was announced in February 2019 with implementations late 2019. Neoverse V1 was just announced with first implementations likely later this year. So that's 50% performance gain in 2 years. It's the same for Cortex-A76 to Cortex-X1. The roadmap shows another 30% in the next 2 years, so that averages to 18% per year over 4 years. That's very hard to beat.Many companies have tried custom cores but gave up (QC are at their 4th try with Nuvia and kept using Cortex cores alongside custom designs). It's possible to differentiate and get a 10-20% gain but gaining 10% yearly is unlikely. Apple is ahead because they spend many times more area on cores and caches than everybody else and pay a premium for early access to the latest processes. Those are one-off gains, you can't keep doubling your caches every year...
Ampere Altra is proof of performance and efficiency leadership for a stock Arm core. From the article the decision to go custom seems more about specialization for the cloud market. You could cut down the floating point and Neon units if integer performance is all you want.
Wilco1 - Wednesday, May 19, 2021 - link
Btw it's actually possible Arm's latest Neoverse cores are simply too fast. A custom core that is say only 20% faster than Neoverse N1 (but at same area by cutting down FP/Neon and 32-bit support) might well be more suited for certain cloud applications.mode_13h - Friday, May 21, 2021 - link
> it's actually possible Arm's latest Neoverse cores are simply too fast.LOL.
> cutting down FP/Neon
They're already #losing on FP. And the N1's vector width is only 2x 128-bit, as compared with Zen's 4x 256-bit and Intel's 2x 512-bit. So, it's not like the N1 is burning a ton of die space on it, or has a lot of excess FP performance to spare.
Wilco1 - Friday, May 21, 2021 - link
Ampere is aiming at integer cloud workloads. N1 achieves 86% of the FP performance of Milan with just 2x128-bit pipes, which is likely more than Ampere needs. Since N1 is absolutely tiny compared to Zen 3 or IceLake, the 2x128-bit pipes are actually a significant fraction of its die size.mode_13h - Friday, May 21, 2021 - link
> the 2x128-bit pipes are actually a significant fraction of its die size.Let's take this as given and look to the N2. It's not increasing its vector pipelines in either width or count (but they report some wins from simply using SVE), in spite of going to 5 nm. So, if they're not increasing the N2's FPU, in spite of the overall core size getting larger, then that means it'll constitute a smaller portion of the N2. That naturally leads to the question of what Siryn stands to gain, by making an even *smaller* FPU? I'd say: not much.
Wilco1 - Saturday, May 22, 2021 - link
N2 adds SVE2 so the vector pipelines are larger and more complex. FP performance is likely 40% higher than N1, so you could cut out a lot if all you wanted is bare-bones FP/SIMD like Ampere's eMAG generation.Saving 10-15% area directly translates to more cores/cache and lower costs. We are talking about 128+ cores so small savings per core add up.
mode_13h - Sunday, May 23, 2021 - link
> N2 adds SVE2 so the vector pipelines are larger and more complex.> FP performance is likely 40% higher than N1
Why would you expect the same number & width of pipelines @ the same clock speed to deliver that kind of speedup? According to ARM's own estimates, the N2 delivers only about 20% speedup at about 60th percentile. Median speedup looks to be only about 15%.
https://images.anandtech.com/doci/16640/Neoverse_V...
> Saving 10-15% area directly translates to more cores/cache and lower costs.
It's very unlikely one 128-bit SVE pipeline takes 10% of die area, and you can't have less than 128-bits in ARMv9. I suppose they could simplify the pipeline stages at the expense of latency, but that could really start to hurt code that uses even a little floating-point, which is probably more common in "server workloads" than it used to be.
Wilco1 - Tuesday, May 25, 2021 - link
> Why would you expect the same number & width of pipelines @ the same clock speed to deliver that kind of speedup?Because the rest of the core is improved. The bottleneck is usually the frontend and caches rather than the FP pipes. For example the Cortex-A77 article shows a 23% speedup over Cortex-A76 on SPECINT_2017 but a larger 28% speedup for FP despite no reported changes to the SIMD pipes: https://images.anandtech.com/doci/14384/CortexA77-...
You can reduce the area significantly, eg. using a single 64-bit FMA pipe that needs 2 cycles for 128-bit SIMD. That would work fine with code that needs a bit of scalar floating point.
mode_13h - Wednesday, May 26, 2021 - link
> You can reduce the area significantly, eg. using a single 64-bit FMA pipe that> needs 2 cycles for 128-bit SIMD. That would work fine with code that needs
> a bit of scalar floating point.
Alright. I'll concede this point. Maybe one of the points of differentiation in their cores is less area devoted to FP. I doubt it, but we're into pure speculation, here. I certainly can't say you're wrong.
Let's just hope that whatever they've got in the works is compelling and somehow meaningfully different from what ARM has announced and the other offerings their competitors will have on the market. I do want to see Ampere succeed and live on to mature into a more formidable player.
ikjadoon - Thursday, May 20, 2021 - link
Read it again: Arm releases a new uarch every year. Arm does not release (or simply cannot release due to internal deficiencies / failures) a Neoverse variant.uArch improvements are judged per-generation so it’s actually comparable with everyone else.
You can skip Intel or AMD or Apple generations, too, and cherry-pick your way absurd “40% Gen over Gen improvement” numbers. 👌
Nobody is making 40% per year ST gains. 💀
If you can’t understand that, I’ll let you go. 🙏 Ampere is heavily MT-focused, so their goals are much more likely more cores / better core-to-core topologies / lower power.
//
Apple has been making “one-off” gains for the better part of a decade. It’s clear nobody else was apparently interested enough to even try.
You don’t think Intel wants to make big cores? Is AMD some anti-cache fanatic? Nope. They simply did not feel they were necessary and so they reap what they sow. 🤷♂️
Wilco1 - Thursday, May 20, 2021 - link
Neoverse N1->N2 is 40% and N1->V1 is 50% gain in a single generation. You simply can't deny that N1->V1 didn't get 50% performance gain in 2 years. Nobody on here claimed that means 40% year on year or 50% every generation, you're just making that up and cherry pick your numbers.Apple has been consistently ~2 generations ahead of everybody else. The gap is not increasing, so we are talking about one-off gains. If you slap enough cache around a Cortex-X1, clock it high on TSMC 5nm, it will reduce the gap significantly. An Arm slide suggests 20-30% gain is easy.
AMD/Intel cores are significantly larger than Neoverse cores. There is no doubt AMD's huge caches help a lot, but they are using ~3 times more silicon. Altra achieves incredible performance using a much smaller silicon budget. Future Arm servers could also move to chiplets and increase caches significantly. So again, that's a one-off gain.
mode_13h - Friday, May 21, 2021 - link
> cherry pick your numbers.Oof. You're one to talk!
mode_13h - Friday, May 21, 2021 - link
> Ampere Altra is proof of performance and efficiency leadership for a stock Arm core.N2 is projected to have *worse* efficiency than N1, to say nothing of the V1!
I won't repeat what I posted below, but ARM could be running out of gas, on the efficiency front!
Wilco1 - Friday, May 21, 2021 - link
Biased much? If anything getting 40% IPC gain at pretty much the same efficiency is incredibly impressive. It's normal in the industry for higher IPC cores to lose efficiency. I'm not sure how you can say Arm is running out of gas when on the same process you have 128 Arm cores using less power than 64 cores in Milan...mode_13h - Friday, May 21, 2021 - link
> Biased much?I'm just trying to take an honest look at the data. Maybe you should re-read that article's conclusion. It voices similar reservations about the N2's power consumption.
The point is not a minor one. This is a major and disturbing break in ARM's trend of advancing perf/W, and it's in their product line which is supposed to balance efficiency as an equal priority.
> I'm not sure how you can say Arm is running out of gas
...on the efficiency front! You win nothing by such blatant twisting of my words.
Wilco1 - Saturday, May 22, 2021 - link
Is it really honest? N2's perf/W on the same process is just 3.5% lower than N1. That's not a big deal, a "disturbing break in ARM's trend of advancing perf/W" or "ARM could be running out of gas, on the efficiency front".Now if the IPC gain was just 10% instead of 40% then you might have a point, but maintaining efficiency at such large IPC gains is extremely difficult and unheard of.
mode_13h - Sunday, May 23, 2021 - link
> N2's perf/W on the same process is just 3.5% lower than N1.Okay, I had thought it was at 5nm, which was used for their other projections. I now see that the efficiency slide says "(ISO process & configuration)". So, if most/all implementations use 5 nm, then it should improve on perf/W, which should keep it competitive.
Well, I learned something I would've missed without this exchange, so thanks.
> maintaining efficiency at such large IPC gains is extremely difficult and unheard of.
Is it? Hasn't Apple shown good efficiency at even greater IPC?
mode_13h - Friday, May 21, 2021 - link
> Neoverse N1 was announced in February 2019 with implementations late 2019.> Neoverse V1 was just announced with first implementations likely later this year.
> So that's 50% performance gain in 2 years.
Which models and numbers you pick depends on your goal. If you just want to make ARM look as impressive as possible, then I think you're on the right track. However, what you've done is like comparing the machine learning performance of Comet Lake vs Ice Lake-SP. You could tout some insane year-on-year improvement, but that ignores the fact that they're in different product lines and were optimized (and priced) to do different sorts of things. Also, Comet Lake's micro-architecture is like 5 years old, even though it just launched last year.
So, go ahead and bicker over technicalities, like which CPU started shipping in which year. However, if the point is to establish some sort of performance trendline, in order to try and predict what sorts of improvements *future* ARM cores might provide, you're just leading yourself astray.
mode_13h - Friday, May 21, 2021 - link
> that doesn't change the performance gain of Neoverse V1But it uses a lot more power. So, it's not the typical way we're accustomed to looking at perf gains, where it's at most ISO-power. By ARM's own admission, it's 0.7x to 1x as efficient, which means 1.5x to 2.14x the power!
Also, 1.7x the area, which means > 1.7x the cost (at ISO process).
So, it's really disingenuous to talk about the V1 as an example of uArch gains. Again, your heavy bias is clear for all to see.
And even the N2 doesn't look so great, once the power estimates are taken into account. 1.4x the IPC at 1.45x the power (ISO frequency). It starts to look like ARM has finally hit a wall on efficiency.
Wilco1 - Friday, May 21, 2021 - link
No, in fact larger, faster cores are less efficient (just like large heavy cars are less efficient than small cars). Small in-order cores like Cortex-A55 are the most efficient.So higher IPC typically comes at a cost in area and power. V1 is still smaller and use less power than the latest x86 cores. So I'm not sure what your point is?
And it is not disingenuous in any way to mention the fact that Neoverse V1 has 50% higher IPC. It's simply Arm's fastest core, you can't argue with that achievement. Should we ignore Milan because it is larger and less efficient than Rome? Milan is still a great achievement. Or would you argue it is not?
mode_13h - Friday, May 21, 2021 - link
> larger, faster cores are less efficient (just like large heavy cars are less efficient than small cars).yes.
> V1 is still smaller and use less power than the latest x86 cores.
But you didn't compare it to an x86 core. You compared it to the N1, which is like comparing a sports car to a previous year's family sedan, if we use your automotive analogy.
> it is not disingenuous in any way to mention the fact that Neoverse V1 has 50% higher IPC.
While it's an accurate repetition of a credible claim (i.e. something we can treat as a fact), it's what you're implying by it that makes it disingenuous. It's comparing cores in 2 different product lines, with different optimization targets. That's why it's called V1 and not N2. It's not a like-for-like comparison, which makes it difficult to infer much of anything from it. All it really tells us is how much faster ARM can make a server core on 7 nm, if they don't care as much about power or area.
Wilco1 - Saturday, May 22, 2021 - link
No. Both Neoverse N2 and V1 are not only successors of N1 but their microarchitectures are related to N1. So it is completely reasonable to compare N1 with V1. And the timeframe and performance gains are matching Cortex-A76 to Cortex-X1.> All it really tells us is how much faster ARM can make a server core on 7 nm, if they don't care as much about power or area.
And that refutes the original claim that Arm has ran out of steam in terms of IPC gains. Indeed, V1 uses more power and loses efficiency but that's the cost of pushing IPC hard. Hence the split in product lines as different licencees likely want different PPA targets.
mode_13h - Sunday, May 23, 2021 - link
> Both Neoverse N2 and V1 are not only successors of N1 but their microarchitectures are related to N1.Only in the most roundabout of ways. Andre's graphic shows the relationship:
https://images.anandtech.com/doci/16640/siblings.j...
> So it is completely reasonable to compare N1 with V1.
No, that's nonsense. The V1 is 33% bigger than the N2 and 70% bigger than the N1, and burns up to 2.14x as much power. A lot of its performance comes from the same places as the X1 vs. A78, as well as SVE. So, it makes as much sense as comparing a car with an 8-cylinder engine to a cheaper & more fuel-efficient 4-cylinder car of the previous model year, from the same manufacturer. One can certainly make such a comparison, but it's unclear what practical relevance it would have.
If you want to look at microarchitectural efficiency improvements, then you'd best focus on an apples-to-apples comparison, like A76 -> A78 or N1 -> N2.
> that refutes the original claim that Arm has ran out of steam in terms of IPC gains.
It would, if anyone had made such a claim. What I said was: "It starts to look like ARM has finally hit a wall on efficiency."
As that was before I noticed the 1.45x power figure was at ISO process with the N1, I'll walk that back a couple steps.
ikjadoon - Wednesday, May 19, 2021 - link
I think this move genuinely justifies Arm’s business model: make the ISA—not your architectures—the product.Startup vendors can…
1. License stock core IP so anyone with money & some of silicon expertise can whip up chips. Start making money quickly and build a reputation (and customer list).
2. Then, if vendors feel confident and have the money & expertise, let us license the ISA alone. Keep all our current customers, sell them something even better and customized to their needs, and put us in near-total control.
It’s like the franchise mode on steroids: imagine if owning a Subway restaurant franchise let “upgrade” to an international logistics contract. Imagine the innovation markets it’d create.
If only Intel understood that 25+ years ago. Today, we get vague IDM “2.0” promises to allow some vague, non-committal licensing options in the unspecified future (and absolutely not the ISA because Intel shamelessly believes it does x86 the best).
Blastdoor - Wednesday, May 19, 2021 - link
Building a business based on Intel's alleged commitment to licensing sounds about as safe a bet as the companies that tried to license MacOS back in the 90s. Intel now, like Apple then, is being forced to adopt a model they really don't believe in. It's just not who they are. Apple ultimately returned to its true self and that has clearly worked out well for them. I have no idea if Intel can pull off something similar.Silver5urfer - Wednesday, May 19, 2021 - link
So Omega gg ? lolWe will see what Sapphire Rapids and Genoa will have, if AMD increases the cores past 64 to near 80-90s then AMD will be on the moon.
mode_13h - Friday, May 21, 2021 - link
> if AMD increases the cores past 64 to near 80-90s then AMD will be on the moon.Depends if their interconnect can continue to scale. Milan is really getting hurt by it, but some of that is due to the old process node of their IO die. Still, even on a newer node, more cores *and* higher speeds could continue to take a big bite out of their power budget. Meshes scale best.
Linustechtips12#6900xt - Wednesday, May 19, 2021 - link
So I wasn't the only one who had a tear of joy that AnandTech was going back to GPU reviews when i read ampere?Unashamed_unoriginal_username_x86 - Wednesday, May 19, 2021 - link
I think you were, a high profile server competitor announcing a (largely) unexpected turn is more interesting than a 7/8 month late reviewmode_13h - Friday, May 21, 2021 - link
> 7/8 month late reviewIf I'm reading the review for a deep dive look at the architecture, then it doesn't matter how long the product has been out. The only thing that makes it less relevant is if a newer architecture is written up.
Matthias B V - Wednesday, May 19, 2021 - link
Nvidia should have just bought Ampere, Nuvia or Cavium (back when those were available) instead of ARM.Would have been cheaper and less risk. Now they pay a lot, have to wait long and face risk in both outcomes:
- They can't buy ARM and end with empty hands. Developing their own but need lots of talent...
- They can buy it and it kills ARM foundation in the long run as customers don't like it and move to RISC-V by next decade.
They could have even bought one of those ARM designers and on top try to work with VIA or IBM on a x86 license. I think they do not need and it is minimal sense but it would still be less risk and cheaper having both than buying ARM directly.
Rudde - Wednesday, May 19, 2021 - link
I don't see any point in getting a x86 license. The ARM ISA is better structured than x86. Namely, it doesn't rely as much on predicting branches.Spunjji - Friday, May 21, 2021 - link
Nvidia did, once upon a time. They tried pretty hard to get it - that was what the project that ended up in the Denver cores originally started out as.lightningz71 - Wednesday, May 19, 2021 - link
As we have seen with Apple's A series and now the M1 chips, rolling your own ARM cores can give you a LOT of performance uplift over ARM's standardized designs. If they want to be competitive on the top end with a compelling enough product to attract hyperscalers, they HAVE to have a world beating product. They weren't going to do that with an ARM default design.back2future - Wednesday, May 19, 2021 - link
How often in future and years Apple will show this advance before all other competitors and being recognized like a real advance for everyday usage patterns for customers (without comparing prices and/or software availability or compatibility)? Thunderbolt for e.g. is Intel engineering product.brucethemoose - Wednesday, May 19, 2021 - link
Or it can be a flop afters years of investment, like Samsung's in-house cores, or a neat niche processor that doesn't justify the development costs, like Marvell's SMT ARM cores.But you might not be wrong either. The V1 appears to be no slouch, but can it ever be massaged into something like Apple's cores?
lightningz71 - Wednesday, May 19, 2021 - link
I don't really consider Samsung's efforts with exynos to really be comparable here. At best, they seemed to be largely tweaks of the basic ARM IP, with design decisions that were tilted towards what they considered important for their use cases. It also appears to be the case that they had substandard communication between their CPU architects and their scheduler designers as some benches seemed to indicate.I also still believe that the relative performance of the Apple A series cores is HEAVILY influenced by their total top down control of iOS as well. Both are fully optimized for each other in ways that no other architecture pairing can fully claim.
brucethemoose - Wednesday, May 19, 2021 - link
We'll find out once and for all when the M1 is up and running on linux.AdrianBc - Thursday, May 20, 2021 - link
Apple's cores have higher IPC, thus higher single-thread performance than ARM's cores.However, when made in the same TSMC process and when limited at the same power consumption, their performance is the same or worse as that of the ARM's cores.
For multi-threaded applications, which are the most important for servers, the power-limited performance is what counts. From the SPEC benchmarks published here at Anandtech I have not seen any advantage for Apple. The total performance was higher for Apple, but it was accompanied by a proportionally higher power consumption.
Apple's cores are without doubt better in a personal computer, where single-thread performance matters a lot, but until now there is no data showing that they would be better for a server CPU.
serendip - Thursday, May 20, 2021 - link
Server loads are all about perf/watt and perf/$ which ARM N1 designs excel at. Now I wonder how the latest Altra would compare to the latest Graviton.back2future - Thursday, May 20, 2021 - link
maybe also interesting, what interSoC connections allow multi vendor mainframes, optimized for routing data for its processing profile advantages (on cpu cache pipeline, hardware logics or ?PUs, on-site memory), what might support each vendors emphasis. What to expect on ARM side comparable to point-to-point processor interconnect like comparable to HTX (~3.1) or IF on networking workaround (advanced capability with additional hardware complexity) instead of direct "QPI"-likesmode_13h - Friday, May 21, 2021 - link
> What to expect on ARM side comparable to point-to-point processor interconnect ...https://www.anandtech.com/show/16640/arm-announces...
back2future - Friday, May 21, 2021 - link
sorry, i didn't get the point considering multi socket SoC boards or backplane connectors towards mixed vendor data processing configurations (although interesting, even price profile for ThunderX/ThunderX2, considering Marvell's cancelling for ThunderX3's 'in favor of vertical markets and the hyperscaler server market'_Networkworld ?)back2future - Friday, May 21, 2021 - link
(did follow wrong link on open tabs, https://store.avantek.co.uk/avantek-56-core-cavium... i see CCIX compared to CXL https://semiengineering.com/choosing-the-appropria... how about in memory processing (IMP) or retimer latencies (CXL only?) or symmectrics balancing without (too much) protocol overhead on real products? everythings ahead of x64?)back2future - Friday, May 21, 2021 - link
(or traditionally more often called PIM processing-in-memory :) btw, https://arxiv.org/pdf/1802.00320.pdf )mode_13h - Friday, May 21, 2021 - link
> Server loads are all about perf/watt and perf/$ which ARM N1 designs excel at.Then I guess the N2 has already failed. ARM claims 3.5% lower perf/W than N1. It does achieve 7.7% better PPA, so maybe perf/$ is slightly improved.
Wilco1 - Saturday, May 22, 2021 - link
Don't forget you get 40% higher performance at the same frequency. If you reduce frequency by 10% N2 will be ~20% more efficient than N1 and still 30% faster. So a higher IPC at the same efficiency allows you to be far more efficient if required.webdoctors - Wednesday, May 19, 2021 - link
Competition is great, let's see what awesome high performance server chips come out going forward.eastcoast_pete - Wednesday, May 19, 2021 - link
I think Ampere sees the writing on the wall. The problem with using standard ARM designs going forward is that it'll be a race to the (pricing) bottom, basically who can deliver the most cores for the buck. Ampere and the other independent ARM server CPU design houses cannot really differentiate on manufacturing; there are only two choices, one of them not as good - Samsung, so all want TSMC.That leaves a better architecture, custom made, to differentiate from the rest of the growing pack. I think Ampere tries to pull off what Apple was so successful at doing for mobile and now laptop/desktop use: create a better uarch than the one ARM has to offer. But, it's a risky move indeed.And, entirely without sources, here a thought: Maybe they found a sugar daddy to finance that move, I am thinking about a certain company headquartered in Redmond, WA. Would fit with their most recent announcements of customer wins.
name99 - Wednesday, May 19, 2021 - link
The obvious choice is QC acquires Ampere.The world is just not big enough to support this many ongoing CPU designs. You need a huge market to amortize your costs, and you have to keep doing it EVERY YEAR. You need three or four teams running in parallel each at different stages of the next designs. No way Ampere can maintain that; they can create their one core which may even be good -- but which won't be updated *substantially* till four years later... Meanwhile Apple is delivering annual 20..30% increases, and even ARM Ltd hopes to provide 15% or so annually.
Nuvia probably saw that reality from the start, were mainly treading water for a year trying to execute on the "be acquired before we even ship a product" part of the plan.
Apple, ARM (for hyperscalers and random use cases that just want a core, any core) and QC (for mobile, desktop, "servers", and various non-hyperscaler warehouses) is probably all that's economically feasible.
EthiaW - Wednesday, May 19, 2021 - link
Perhaps acquiring Ampere would be a nice choice for Nvidia, they will then have a complete portfolio for server processors. The ARM deal doesn't seem to be making progress anyway.grant3 - Thursday, May 20, 2021 - link
Apparently nvidia wants to be the guy who collects the rent, not the one who pays the rent.I do not see how buying ampere fulfills whatever their strategic ambitions are with a purchase of ARM
EthiaW - Thursday, May 20, 2021 - link
ARM has a large bulk of revenue (as well as a large bulk of market cap) coming from little loT processors l, while Nvidia has ambition only for big fat chips no smaller than phone SoCs. Frankly speaking Nbidia only needs high performance CPU to finish its server portfolio. Acquiring ARM is an overkill, finally technically and politically.mode_13h - Friday, May 21, 2021 - link
> Frankly speaking Nbidia only needs high performance CPU to finish its server portfolio.No, they're making a huge push into self-driving cars and robots. They have a whole line of SoCs for that, which have been ARM-based since the time they tried to sell them into the phone and tablet market.
mode_13h - Friday, May 21, 2021 - link
And their latest embedded ARM cores are reportedly not very competitive even with ARM's own Cortex offerings.ksec - Wednesday, May 19, 2021 - link
The N2 is suppose to be 40%+, that is a huge jump in terms of IPC. I would expect Ampere's custom core to at least out perform N2, otherwise why make a custom core?But then the N2 is actually pretty decent core at only ~2W.
mode_13h - Friday, May 21, 2021 - link
> expect Ampere's custom core to at least out perform N2, otherwise why make a custom core?Better perf/W? Better performance on specialized workloads?
EthiaW - Wednesday, May 19, 2021 - link
Until now EVERY single attempt for customized ARM microarchitectures has failed miserably except Apple.Good luck ampere.
brucethemoose - Wednesday, May 19, 2021 - link
Fujitsu's cores (based on their SPARC designs, I think?) benched well, though they don't seem to be commercially successful.eastcoast_pete - Thursday, May 20, 2021 - link
If memory serves, Fujitsu actually developed their new cores in collaboration with ARM because SPARC had pretty much run its course. In other words, I doubt that Fugaku has much SPARC DNA in it. That ARM-Fujitsu collaboration also resulted in the addition of SVEs to large ARM-based server cores; while I don't have an inside track, SVE seems to be something Fujitsu contributed heavily toname99 - Friday, May 21, 2021 - link
The genesis of SVE remains murky, but I doubt that Fujitsu had much to do with it.If you look at Apple Patents, there's a large body of patents with the name of Jeffry Gonion on them, surrounding the "macroscalar" architecture. It's hard to be sure quite what this was supposed to be, but if you squint at it you can see various elements of SVE there - certainly predication, and the idea of "indefinite length" vectors.
I could imagine something like Apple talking to ARM around 2010, ARM saying "we have these ideas for the successor to NEON", Apple saying "we have these ideas for how to improve vectors generally", and a synthesis coming from those two pools.
But again, honestly, I don't know.
mode_13h - Friday, May 21, 2021 - link
> Fujitsu actually developed their new cores in collaboration with ARMTheir Hot Chips 2018 presentation actually makes it sound like Fujitsu collaborated with ARM to develop SVE, itself!
"Fujitsu, as a lead partner, collaborated closely with Arm on the development of SVE"
"Our own microarchitecture maximizes the capability of SVE"
Nowhere does it say that ARM helped them design their chip.
> I doubt that Fugaku has much SPARC DNA in it.
Their Hot Chips 2018 presentation directly contradicts that.
"A64FX inherits DNA from Fujitsu technologies used in the mainframes, UNIX and HPC servers"
> That ARM-Fujitsu collaboration also resulted in the addition of SVEs to large ARM-based server cores
Yeah, Fujitsu's. And soon, Neoverse V1.
> while I don't have an inside track, SVE seems to be something Fujitsu contributed heavily to
Well, that's what they claimed.
https://www.anandtech.com/show/13258/hot-chips-201...
mode_13h - Friday, May 21, 2021 - link
> they don't seem to be commercially successful.According to whom? Granted, their perf/$ is bad, but I'm sure that's as much due to HPE as anything.
Anyway, I think the commercial play was probably just to recoup some of the development costs and get SVE-capable hardware in more people's hands (i.e. to benefit ecosystem support for SVE). I think the Japanese made it mostly to be self-reliant for their HPC needs, rather than for competitive reasons.
mode_13h - Friday, May 21, 2021 - link
Also, Nuvia must have something good (or else it's hard to see why Qualcomm would buy them).maeann - Tuesday, May 25, 2021 - link
Has anyone checked out the $A$S tokens? Also, has anyone looked into buying domains (weed and crypto) ahead of legalization? I saw http://69pot.com had a big list of domains for sale and I’m thinking about buying one. Thoughts?mode_13h - Wednesday, May 26, 2021 - link
spammerelnada - Tuesday, May 25, 2021 - link
wow great thanks so much for this articlehttps://elnadah.com/
mode_13h - Wednesday, May 26, 2021 - link
spammer