Zen with 3D cache vs Saphire Rapids with HBM should be interesting.
Early returns from Adler Lake (same Golden Cove core as Saphire) suggests intel will have a +/- 25% IPC lead, but AMD will retain a large power consumption advantage - it is hard to project how much of this IPC intel will need to give back to keep reasonable thermals. HBM should outperform 3D cache by a good margin, but given what we (think) we know about yields AMD may still have a lead in core counts...
I wonder how much of this is customers simply buying into the performance story of AVX-512 and purchasing the promise vs. actually having AVX-512 workloads where it proves worthwhile.
One often critical one is availability / lead time. Second one is that some software is not supported on AMD or you need to pay additional HW validation bills and subject yourself to intricate details of tuning your network stack / firmwares / network cards and whatnot when you run a bit more special software.
"I wonder how much of this is customers simply buying into the performance story of AVX-512 and purchasing the promise vs. actually having AVX-512 workloads where it proves worthwhile."
Very little: in this realm you don't buy hardware based on specs and review benchmarks, you have samples in house running your ACTUAL workload to determine real performance before rolling it out at scale.
I know that by experience. I spent two months testing 3 different racks of 20 units (two major brands and a "no brand" OEM) to advise a large purchasing customer on which ones to select. The final advice is not reduced to a Yes or No but rather the pros and cons of each system. The best benchmark results are not the only factor, software compatibility, manuals, professional services, training programs, existing relations with vendor, other support and pricing (final deal) count a lot too. In the end, using a comprehensive check list, you may select not the fastest system or even a system that has bugs (but with workarounds) due to significant difference in pricing or other arrangements. Utterly different from the DYI market. Analog: if you're going to buy 100 x 18-wheel trucks, you're going to spend quite some time evaluating the possible candidates and it will take time. Utterly different than going to a car dealer and buy a car on the spot.
Forgot to mention: the testing/benchmark is done with the software that will run on the systems once acquired, not synthetic/generic benchmarks which you might use at the beginning to validate the systems. 90+% of the testing is done with production / apps and load scripts to feed them.
Right, so it sounds like a lot for 70% of sales to hinge on performance, much less specifically AVX-512.
I guess where my skepticism comes from is in cloud scenarios where there's not just a single app the customer intends to run. In such cases, I don't know how you can say that AVX-512 made the deal, unless they use it as a generic answer to reservations the customer might raise about performance.
Also, I just can't see the case for many customers to even care about vector arithmetic. If you're running a web server, CI (Continuous Integration), or many databases, it's integer performance you're likely to care about. 70% of the workloads running on cloud & enterprise servers aren't even ones that benefit from AVX-512!
Well, I'd say it is possible 70% of their sales wins relies on this one AVX-512 feature - but if you are right and it is only interesting for a minority of customers, that simply means they are not selling a lot... And indeed, if this argument goes away with AMD's next gen, that's even worse news.
All of their BIG server processors since 2017 support AVX-512. And while the Datacenter Group's revenues are down, they're still the second-biggest profits among Intel's business units.
@mode_13h - that's a very good point. Maybe it's one of those things where the buyer would have selected Intel anyway for ancillary reasons. but are offering AVX-512 as a rationalisation when given a limited range of options for providing feedback (which would imply it's the thing the sales guys are selling hardest).
Sapphire Rapids has AMX, which is a little like AVX-8192. I think it'll be a big win for a few very specific cases, but less generally useful than AVX-512.
I'm curious why you say AMX is like AVX-8192. My understanding is that AMX is essentially a configurable fused multiply-add accelerator, with the added bonus of some configuration registers. However, I'm not an AI guy so I welcome corrections.
Because that's how big the registers are. 1 kB each (there are 8 of them, BTW). As for the configurable part, it's true that operations don't have to use the entire register.
I'm not saying it *is* AVX-8192. Just that you could sort of look at it that way. The point was only to tie it into the lineage of what came before. For anything beyond that, you'll want to dig into the specifics and understand it for what it *is*.
Ian, the AVX 3DPM benchmark is concerning me. Given the grossly asymmetric optimization for AVX-512 vs. AVX2, I think it's not a good performance characterization for AVX2 vs. AVX-512 CPUs.
If the AVX2 path could be optimized to a similar degree, then I think it would make sense to use it in that way. Unless/until that happens, I think you should only use it to compare like-for-like CPUs (i.e. AVX2 vs AVX2; AVX-512 vs AVX-512).
On a related note, please post the source somewhere like github, so that we actually see what it's measuring and potentially have a go at optimizing the AVX2 path, ourselves.
He should just put it up on github and see what people can do with it. Plus, somebody might optimize it for ARM, too. He's already shared it with Intel and AMD, so what's the big deal?
I don't think there's anything particularly wrong with that. It may be disproportionate to other benchmarks, but if all benchmarks scaled the same, there'd be no point in having more than one at all. It's a real-world workload (custom in-house programs are perhaps the most real-world workloads there are), and it does demonstrate the fact that some programs really benefit by AVX-512.
Realistically, I don't think it should've been shared with Intel and AMD (it would've arguably been better if it were "pristine"), but given that that has been done, I'd agree there's no point to not making it public any longer. That being said, I'm not sure the point should be to microoptimize it to the ends of the world, or it wouldn't be a realistic workload any longer.
Except it's not. It started out that way, but then he gave it to Intel to optimize the AVX-512 path. So, the AVX-512 is optimized by "a world expert, according to Jim Keller" (to paraphrase Ian). And yet, the AVX-512 results are put up against the AVX2 results, on AMD CPUs, as if they're both optimized to the same degree and that just happens to be the *actual* difference in performance.
As an excuse for this, Ian points out that he gave AMD the same opportunity, but they haven't taken him up on it. Well, that still doesn't make it a fair representation of AVX2 vs. AVX-512 performance.
> I'm not sure the point should be to microoptimize it to the ends of the world, > or it wouldn't be a realistic workload any longer.
A lot of workloads are heavily-optimized. This includes kernels in HPC programs, many games, and the most popular video compression engines. Probably a lot of stuff in SPEC Bench has been optimized a high degree. And let's not even start on AI frameworks.
All I want to do is see if people can close the gap between AVX2 and AVX-512 somewhat, or at least explain why it's as big as it is. Maybe there's some magic AVX-512 instructions that have no equivalent in AVX2, which turn out to be huge wins. It would at least be nice to know.
Plus, there's my point about optimizing it for ARM NEON and SVE, so it could be used in a somewhat apples-to-apples comparison with ARM processors.
I agree it's unfair. On the "non-AVX" test, the Ryzens go to the top. On one hand, the test shows how much faster an AVX512 processor can be. On the other hand, it's unfair that some are running the AVX2 path and some the AVX512, and the results are put together. (Reminiscent of the Athlon XP's SSE not being used in some benchmarks.)
Others, I don't know, but in a thing like HEVC encoding, the gains aren't all that much for these instructions. It leads me to feel the 3DPM test is gaining disproportionately from AVX512, in a narrow sort of way, and that's being magnified. The result shows, "Look at how fast these AVX512 CPUs are, leaving their AMD counterparts in the dust."
> it's unfair that some are running the AVX2 path and some the AVX512, > and the results are put together.
That's a reasonable position, but I'm not even going that far. I'm okay with putting up AVX2 against AVX-512, but I think they need to be optimized somewhat comparably. That way, the difference you see only shows the true difference in hardware capability, and not also the (unknown) difference in the level of code optimization.
> "Look at how fast these AVX512 CPUs are, leaving their AMD counterparts in the dust."
It does have a few specialized instructions that have no AVX2 counterpart. And if you're doing something they were specifically designed to accelerate, then you can get a legit order of magnitude speedup. And it's not impossible 3DPM hit one of those cases. But, in order to know, Ian really needs to disclose the code.
We don't know, so don't presume. There are some obvious things you can get wrong that sabotage performance. Cache thrashing, pointer aliasing, and false sharing, just to name a few. Probably a lot of the speedup, in the AVX-512 case, was fixing just such things.
@GeoffreyA - I would argue that it wouldn't necessarily be unbalanced if the benchmark benefits particularly heavily from AVX-512, simply because there are going to be workloads like that out there, and the people who have them are probably going to be aware of that to some extent.
With comparable optimisation between the AVX2 and AVX-512 code paths, it could still be a helpful example of a best-case for the feature, for those few people for whom it's going to work out like that.
For everyone else, we could definitely do with more generalised real-world examples (like x264) where the AVX-512 part of the workload isn't necessarily dominant.
And for a best AVX2 vs. best AVX512, I think we probably need some bigger test, something like encoding I would think. I could be wrong, but remember reading that x264 had AVX512 support. I doubt whether it's been optimised to the fullest, though. And most of the critical work on x264 was done a long time ago.
> All I want to do is see if people can close the gap between AVX2 and AVX-512 somewhat, or at least explain why it's as big as it is. Maybe there's some magic AVX-512 instructions that have no equivalent in AVX2, which turn out to be huge wins. It would at least be nice to know.
AVX-512, as an instruction set, was a huge leap forward compared to AVX/AVX2. So much so that Intel created the AVX-512VL extension that allows one to use AVX-512 instructions on vectors smaller than 512-bits. As a vector programmer, here are the things I like about AVX-512:
1) Dedicated mask registers and every instruction can take an optional mask for zeroing/merging results 2) AVX-512 instructions can broadcast from memory without requiring a separate instruction. 3) More register (not just wider)
Also, and this is kind of hard to explain, but AVX/AVX2 as an instruction set is really annoying beacause it acts like two SSE units. So for example, you can't permute (or "shuffle" in Intel parlance) the contents of an AVX2 register as a whole. You can only permute the two 128-bit halves as if they were/are two SSE registers fused together. AVX-512 doesn't repeat this half-assed design approach.
> 1) Dedicated mask registers and every instruction can take an optional > mask for zeroing/merging results
This seems like the only major win. The rest are just chipping at the margins.
More registers is a win for cases like fitting a larger convolution kernel or matrix row/column in registers, but I think it's really the GP registers that are under the most pressure.
AVX-512 is not without its downsides, which have been well-documented.
@Elstar - Interesting info. Just makes me more curious as to how many of these things might be benefiting the 3DPM workload specifically. Another good reason for more people to get eyes on the code!
>All I want to do is see if people can close the gap between AVX2 and AVX-512 somewhat, or at least explain why it's as big as it is. Maybe there's some magic AVX-512 instructions that have no equivalent in AVX2, which turn out to be huge wins. It would at least be nice to know. I don't remember where it was posted any longer (it was in the comment section of some article over a year ago), but apparently 3DPM makes heavy use of wide (I don't recall exactly how wide) integer multiplications, which were made available in vectorized form in AVX-512.
Performance optimization is converged upon from two different directions: 1) the code users run to perform a task, and 2) the compute hardware upon which the code is intended to run. As an Intel engineer, for some time I was in a performance evaluation group. We ran many thousands of simulations of all kinds to quantify the performance of our processor and chipset designs before they ever went to silicon. This was in addition to our standard pre-silicon validation. Pre-silicon performance validation was to demonstrate that the expected performance was being delivered. You may rest assured that every major silicon architectural revision or addition to the silicon and power consumption was justified by demonstrated performance improvements. Once the hardware is optimized, then the coders dive into optimizing their code to take best advantage of the improved hardware. It is sort of like "double-bounded successive approximation" toward a higher performance target from both HW and SW directions. No surprise that benchmarks are optimized to the latest and highest performant hardware.
> You may rest assured that every major silicon architectural revision > or addition to the silicon and power consumption was justified > by demonstrated performance improvements.
Well, it looks like you folks failed on AVX-512 -- at least, in Skylake/Cascade Lake:
I experienced this firsthand, when we had performance problems with Intel's own OpenVINO framework. When we reported this to Intel, they confirmed that performance would be improved by disabling AVX-512. We applied *their* patch, effectively reverting it to AVX2, and our performance improved substantially.
I know AVX-512 helps in some cases, but it's demonstrably false to suggest that AVX-512 is *only* an improvement.
However, that was never the point in contention. The question was: how well 3DPM performs with a AVX2 codepath that's optimized to the same degree as the AVX-512 path. I fully expect AVX-512 would still be faster, but probably more inline with what we've seen with other benchmarks. I'd guess probably less than 2x.
Are you referring to blade servers? But they don't have the ability to host PCIe cards or a dozen SSDs like this thing does. I'm also not sure how their power budget compares, nor how much RAM they can have.
Anyway, if all you needed was naked CPU power, without storage or peripherals, then I think OCP has some solutions for even higher density. However, not everyone is just looking to scale massive amounts of raw compute.
The conclusions in the article are confusing. I'm seeing the Supermicro with similar or superior per-thread performance on every workstation load tested, excluding Photoscan (which slow cores suck at). On SpecINT and SpecFP, the 6330 can only match Zen 2. It's whooped by Zen 3, but the Xeon is so much cheaper that it still wins the price-performance ratio. If you happen to have a use for AVX-512, that's a big Intel win, too. So overall, Intel wins price/performance on every measure. It's a clear "second place" on the very important SPEC measures. It seems like a highly competitive option, but of course, watts are crucial! Availability and volume discounts are hard to measure, but I wish there was some better power consumption data presented. 2x56-core Intel 6330 chips cost a good chunk less than a single AMD 7f53. That's great, but how much more heat can the Intel be expected to generate? These servers could make great VM hosts, or they could be money pits - I want to understand efficiency better.
Huh? It also lost on LLVM compile, Blender, SPECint, and Corona.
> similar or superior per-thread performance on every workstation load tested
This basically is a nonsense metric, for workstations. Per-thread performance only really matters for cloud workloads. On workstations, you can just use bare hardware, which means you can use *all* of the threads it provides.
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
53 Comments
Back to Article
hetzbh - Thursday, July 22, 2021 - link
So ... 70% of the sales win are due to AVX 512? Better hope that Intel finds another strategy as AMD EPYC Genoa adds AVX 512 support as well..Kamen Rider Blade - Thursday, July 22, 2021 - link
I concur, don't forget that they're getting Stacked L3 3D V-Cache.The Vorlon - Saturday, July 24, 2021 - link
Zen with 3D cache vs Saphire Rapids with HBM should be interesting.Early returns from Adler Lake (same Golden Cove core as Saphire) suggests intel will have a +/- 25% IPC lead, but AMD will retain a large power consumption advantage - it is hard to project how much of this IPC intel will need to give back to keep reasonable thermals. HBM should outperform 3D cache by a good margin, but given what we (think) we know about yields AMD may still have a lead in core counts...
Fun times ahead!
Jorgp2 - Thursday, July 22, 2021 - link
Lol, no.mode_13h - Thursday, July 22, 2021 - link
I wonder how much of this is customers simply buying into the performance story of AVX-512 and purchasing the promise vs. actually having AVX-512 workloads where it proves worthwhile.zepi - Thursday, July 22, 2021 - link
There are multiple reasons to buy Intel.One often critical one is availability / lead time. Second one is that some software is not supported on AMD or you need to pay additional HW validation bills and subject yourself to intricate details of tuning your network stack / firmwares / network cards and whatnot when you run a bit more special software.
mode_13h - Thursday, July 22, 2021 - link
> There are multiple reasons to buy Intel.Sure, but I'm not asking about that. I'm asking about the specific claim quoted in the article and mentioned by @hetzbh.
edzieba - Thursday, July 22, 2021 - link
"I wonder how much of this is customers simply buying into the performance story of AVX-512 and purchasing the promise vs. actually having AVX-512 workloads where it proves worthwhile."Very little: in this realm you don't buy hardware based on specs and review benchmarks, you have samples in house running your ACTUAL workload to determine real performance before rolling it out at scale.
mode_13h - Thursday, July 22, 2021 - link
And you know this from experience, or are you just speculating? And if the former, were you on the sales side or a volume purchaser?domih - Thursday, July 22, 2021 - link
I know that by experience. I spent two months testing 3 different racks of 20 units (two major brands and a "no brand" OEM) to advise a large purchasing customer on which ones to select. The final advice is not reduced to a Yes or No but rather the pros and cons of each system. The best benchmark results are not the only factor, software compatibility, manuals, professional services, training programs, existing relations with vendor, other support and pricing (final deal) count a lot too. In the end, using a comprehensive check list, you may select not the fastest system or even a system that has bugs (but with workarounds) due to significant difference in pricing or other arrangements. Utterly different from the DYI market. Analog: if you're going to buy 100 x 18-wheel trucks, you're going to spend quite some time evaluating the possible candidates and it will take time. Utterly different than going to a car dealer and buy a car on the spot.domih - Thursday, July 22, 2021 - link
Forgot to mention: the testing/benchmark is done with the software that will run on the systems once acquired, not synthetic/generic benchmarks which you might use at the beginning to validate the systems. 90+% of the testing is done with production / apps and load scripts to feed them.mode_13h - Friday, July 23, 2021 - link
Thanks.> you may select not the fastest system
Right, so it sounds like a lot for 70% of sales to hinge on performance, much less specifically AVX-512.
I guess where my skepticism comes from is in cloud scenarios where there's not just a single app the customer intends to run. In such cases, I don't know how you can say that AVX-512 made the deal, unless they use it as a generic answer to reservations the customer might raise about performance.
Also, I just can't see the case for many customers to even care about vector arithmetic. If you're running a web server, CI (Continuous Integration), or many databases, it's integer performance you're likely to care about. 70% of the workloads running on cloud & enterprise servers aren't even ones that benefit from AVX-512!
jospoortvliet - Friday, July 23, 2021 - link
Well, I'd say it is possible 70% of their sales wins relies on this one AVX-512 feature - but if you are right and it is only interesting for a minority of customers, that simply means they are not selling a lot... And indeed, if this argument goes away with AMD's next gen, that's even worse news.mode_13h - Saturday, July 24, 2021 - link
> that simply means they are not selling a lot...All of their BIG server processors since 2017 support AVX-512. And while the Datacenter Group's revenues are down, they're still the second-biggest profits among Intel's business units.
Spunjji - Monday, July 26, 2021 - link
@mode_13h - that's a very good point. Maybe it's one of those things where the buyer would have selected Intel anyway for ancillary reasons. but are offering AVX-512 as a rationalisation when given a limited range of options for providing feedback (which would imply it's the thing the sales guys are selling hardest).Spunjji - Monday, July 26, 2021 - link
That really does depend on the size / expertise of the customer / reseller involved.domih - Monday, July 26, 2021 - link
That too!Oxford Guy - Sunday, July 25, 2021 - link
AVX-1024!mode_13h - Sunday, July 25, 2021 - link
Sapphire Rapids has AMX, which is a little like AVX-8192. I think it'll be a big win for a few very specific cases, but less generally useful than AVX-512.Spunjji - Monday, July 26, 2021 - link
Niches within niches, but look how deep this niche goes!SSNSeawolf - Monday, July 26, 2021 - link
I'm curious why you say AMX is like AVX-8192. My understanding is that AMX is essentially a configurable fused multiply-add accelerator, with the added bonus of some configuration registers. However, I'm not an AI guy so I welcome corrections.mode_13h - Monday, July 26, 2021 - link
> I'm curious why you say AMX is like AVX-8192.Because that's how big the registers are. 1 kB each (there are 8 of them, BTW). As for the configurable part, it's true that operations don't have to use the entire register.
I'm not saying it *is* AVX-8192. Just that you could sort of look at it that way. The point was only to tie it into the lineage of what came before. For anything beyond that, you'll want to dig into the specifics and understand it for what it *is*.
mode_13h - Sunday, July 25, 2021 - link
If there's one thing Intel knows how to do, it's more of what they've done before!Foeketijn - Thursday, July 22, 2021 - link
Power and cooling is not cheap in a colo. Using 300W more for the same performance will set you back 1000 bucks a year easily.mode_13h - Thursday, July 22, 2021 - link
Yeah, I'd have expected power-efficiency to be the top priority, followed by density.Spunjji - Monday, July 26, 2021 - link
Ouch!mode_13h - Thursday, July 22, 2021 - link
Ian, the AVX 3DPM benchmark is concerning me. Given the grossly asymmetric optimization for AVX-512 vs. AVX2, I think it's not a good performance characterization for AVX2 vs. AVX-512 CPUs.If the AVX2 path could be optimized to a similar degree, then I think it would make sense to use it in that way. Unless/until that happens, I think you should only use it to compare like-for-like CPUs (i.e. AVX2 vs AVX2; AVX-512 vs AVX-512).
On a related note, please post the source somewhere like github, so that we actually see what it's measuring and potentially have a go at optimizing the AVX2 path, ourselves.
29a - Thursday, July 22, 2021 - link
I've also been complaining about ego mark forever and now they added that terrible AI benchmark to the lineup which they readily admit is bad data.mode_13h - Thursday, July 22, 2021 - link
He should just put it up on github and see what people can do with it. Plus, somebody might optimize it for ARM, too. He's already shared it with Intel and AMD, so what's the big deal?Dolda2000 - Thursday, July 22, 2021 - link
I don't think there's anything particularly wrong with that. It may be disproportionate to other benchmarks, but if all benchmarks scaled the same, there'd be no point in having more than one at all. It's a real-world workload (custom in-house programs are perhaps the most real-world workloads there are), and it does demonstrate the fact that some programs really benefit by AVX-512.Realistically, I don't think it should've been shared with Intel and AMD (it would've arguably been better if it were "pristine"), but given that that has been done, I'd agree there's no point to not making it public any longer. That being said, I'm not sure the point should be to microoptimize it to the ends of the world, or it wouldn't be a realistic workload any longer.
mode_13h - Friday, July 23, 2021 - link
> It's a real-world workloadExcept it's not. It started out that way, but then he gave it to Intel to optimize the AVX-512 path. So, the AVX-512 is optimized by "a world expert, according to Jim Keller" (to paraphrase Ian). And yet, the AVX-512 results are put up against the AVX2 results, on AMD CPUs, as if they're both optimized to the same degree and that just happens to be the *actual* difference in performance.
As an excuse for this, Ian points out that he gave AMD the same opportunity, but they haven't taken him up on it. Well, that still doesn't make it a fair representation of AVX2 vs. AVX-512 performance.
> I'm not sure the point should be to microoptimize it to the ends of the world,
> or it wouldn't be a realistic workload any longer.
A lot of workloads are heavily-optimized. This includes kernels in HPC programs, many games, and the most popular video compression engines. Probably a lot of stuff in SPEC Bench has been optimized a high degree. And let's not even start on AI frameworks.
All I want to do is see if people can close the gap between AVX2 and AVX-512 somewhat, or at least explain why it's as big as it is. Maybe there's some magic AVX-512 instructions that have no equivalent in AVX2, which turn out to be huge wins. It would at least be nice to know.
Plus, there's my point about optimizing it for ARM NEON and SVE, so it could be used in a somewhat apples-to-apples comparison with ARM processors.
GeoffreyA - Friday, July 23, 2021 - link
I agree it's unfair. On the "non-AVX" test, the Ryzens go to the top. On one hand, the test shows how much faster an AVX512 processor can be. On the other hand, it's unfair that some are running the AVX2 path and some the AVX512, and the results are put together. (Reminiscent of the Athlon XP's SSE not being used in some benchmarks.)Others, I don't know, but in a thing like HEVC encoding, the gains aren't all that much for these instructions. It leads me to feel the 3DPM test is gaining disproportionately from AVX512, in a narrow sort of way, and that's being magnified. The result shows, "Look at how fast these AVX512 CPUs are, leaving their AMD counterparts in the dust."
https://networkbuilders.intel.com/docs/acceleratin...
https://software.intel.com/content/www/us/en/devel...
mode_13h - Saturday, July 24, 2021 - link
> it's unfair that some are running the AVX2 path and some the AVX512,> and the results are put together.
That's a reasonable position, but I'm not even going that far. I'm okay with putting up AVX2 against AVX-512, but I think they need to be optimized somewhat comparably. That way, the difference you see only shows the true difference in hardware capability, and not also the (unknown) difference in the level of code optimization.
> "Look at how fast these AVX512 CPUs are, leaving their AMD counterparts in the dust."
It does have a few specialized instructions that have no AVX2 counterpart. And if you're doing something they were specifically designed to accelerate, then you can get a legit order of magnitude speedup. And it's not impossible 3DPM hit one of those cases. But, in order to know, Ian really needs to disclose the code.
GeoffreyA - Saturday, July 24, 2021 - link
"it's not impossible 3DPM hit one of those cases"Possible, even likely. And if so, it's a bit of an unbalanced picture. It will be interesting to see what happens when AMD adds support.
mode_13h - Sunday, July 25, 2021 - link
> Possible, even likely.We don't know, so don't presume. There are some obvious things you can get wrong that sabotage performance. Cache thrashing, pointer aliasing, and false sharing, just to name a few. Probably a lot of the speedup, in the AVX-512 case, was fixing just such things.
Spunjji - Monday, July 26, 2021 - link
@GeoffreyA - I would argue that it wouldn't necessarily be unbalanced if the benchmark benefits particularly heavily from AVX-512, simply because there are going to be workloads like that out there, and the people who have them are probably going to be aware of that to some extent.With comparable optimisation between the AVX2 and AVX-512 code paths, it could still be a helpful example of a best-case for the feature, for those few people for whom it's going to work out like that.
For everyone else, we could definitely do with more generalised real-world examples (like x264) where the AVX-512 part of the workload isn't necessarily dominant.
GeoffreyA - Wednesday, July 28, 2021 - link
That's a good way of looking at it, Spunjji. You're right. Hopefully we can those balanced, real-world examples in addition.GeoffreyA - Saturday, July 24, 2021 - link
And for a best AVX2 vs. best AVX512, I think we probably need some bigger test, something like encoding I would think. I could be wrong, but remember reading that x264 had AVX512 support. I doubt whether it's been optimised to the fullest, though. And most of the critical work on x264 was done a long time ago.GeoffreyA - Sunday, July 25, 2021 - link
My mistake. x265.mode_13h - Sunday, July 25, 2021 - link
Yeah, some of the rendering and encoding benchmarks use it.Elstar - Saturday, July 24, 2021 - link
> All I want to do is see if people can close the gap between AVX2 and AVX-512 somewhat, or at least explain why it's as big as it is. Maybe there's some magic AVX-512 instructions that have no equivalent in AVX2, which turn out to be huge wins. It would at least be nice to know.AVX-512, as an instruction set, was a huge leap forward compared to AVX/AVX2. So much so that Intel created the AVX-512VL extension that allows one to use AVX-512 instructions on vectors smaller than 512-bits. As a vector programmer, here are the things I like about AVX-512:
1) Dedicated mask registers and every instruction can take an optional mask for zeroing/merging results
2) AVX-512 instructions can broadcast from memory without requiring a separate instruction.
3) More register (not just wider)
Also, and this is kind of hard to explain, but AVX/AVX2 as an instruction set is really annoying beacause it acts like two SSE units. So for example, you can't permute (or "shuffle" in Intel parlance) the contents of an AVX2 register as a whole. You can only permute the two 128-bit halves as if they were/are two SSE registers fused together. AVX-512 doesn't repeat this half-assed design approach.
mode_13h - Sunday, July 25, 2021 - link
> 1) Dedicated mask registers and every instruction can take an optional> mask for zeroing/merging results
This seems like the only major win. The rest are just chipping at the margins.
More registers is a win for cases like fitting a larger convolution kernel or matrix row/column in registers, but I think it's really the GP registers that are under the most pressure.
AVX-512 is not without its downsides, which have been well-documented.
Spunjji - Monday, July 26, 2021 - link
@Elstar - Interesting info. Just makes me more curious as to how many of these things might be benefiting the 3DPM workload specifically. Another good reason for more people to get eyes on the code!Dolda2000 - Saturday, July 24, 2021 - link
>All I want to do is see if people can close the gap between AVX2 and AVX-512 somewhat, or at least explain why it's as big as it is. Maybe there's some magic AVX-512 instructions that have no equivalent in AVX2, which turn out to be huge wins. It would at least be nice to know.I don't remember where it was posted any longer (it was in the comment section of some article over a year ago), but apparently 3DPM makes heavy use of wide (I don't recall exactly how wide) integer multiplications, which were made available in vectorized form in AVX-512.
dwbogardus - Saturday, July 24, 2021 - link
Performance optimization is converged upon from two different directions: 1) the code users run to perform a task, and 2) the compute hardware upon which the code is intended to run. As an Intel engineer, for some time I was in a performance evaluation group. We ran many thousands of simulations of all kinds to quantify the performance of our processor and chipset designs before they ever went to silicon. This was in addition to our standard pre-silicon validation. Pre-silicon performance validation was to demonstrate that the expected performance was being delivered. You may rest assured that every major silicon architectural revision or addition to the silicon and power consumption was justified by demonstrated performance improvements. Once the hardware is optimized, then the coders dive into optimizing their code to take best advantage of the improved hardware. It is sort of like "double-bounded successive approximation" toward a higher performance target from both HW and SW directions. No surprise that benchmarks are optimized to the latest and highest performant hardware.GeoffreyA - Sunday, July 25, 2021 - link
Fair enough. But what if the legacy code path, in this case AVX2, were suboptimal?mode_13h - Sunday, July 25, 2021 - link
> You may rest assured that every major silicon architectural revision> or addition to the silicon and power consumption was justified
> by demonstrated performance improvements.
Well, it looks like you folks failed on AVX-512 -- at least, in Skylake/Cascade Lake:
https://blog.cloudflare.com/on-the-dangers-of-inte...
I experienced this firsthand, when we had performance problems with Intel's own OpenVINO framework. When we reported this to Intel, they confirmed that performance would be improved by disabling AVX-512. We applied *their* patch, effectively reverting it to AVX2, and our performance improved substantially.
I know AVX-512 helps in some cases, but it's demonstrably false to suggest that AVX-512 is *only* an improvement.
However, that was never the point in contention. The question was: how well 3DPM performs with a AVX2 codepath that's optimized to the same degree as the AVX-512 path. I fully expect AVX-512 would still be faster, but probably more inline with what we've seen with other benchmarks. I'd guess probably less than 2x.
mode_13h - Thursday, July 22, 2021 - link
> a modern dual socket server in a home rack with some good CPUs> can no longer be tested without ear protection.
When I saw the title of this review, that was my first thought. I feel for you, and sure wouldn't like to work in a room with these machines!
[email protected] - Thursday, July 22, 2021 - link
Why is this still relevant? You can buy CPU 'cards' and stick them in a chassis using less power and cost as much or less.mode_13h - Friday, July 23, 2021 - link
Are you referring to blade servers? But they don't have the ability to host PCIe cards or a dozen SSDs like this thing does. I'm also not sure how their power budget compares, nor how much RAM they can have.Anyway, if all you needed was naked CPU power, without storage or peripherals, then I think OCP has some solutions for even higher density. However, not everyone is just looking to scale massive amounts of raw compute.
mamur - Saturday, July 24, 2021 - link
Intel is dead. Muh AMD will rule the world. Why even allow comment like that? To say you are stupid too?ceomrman - Monday, July 26, 2021 - link
The conclusions in the article are confusing. I'm seeing the Supermicro with similar or superior per-thread performance on every workstation load tested, excluding Photoscan (which slow cores suck at). On SpecINT and SpecFP, the 6330 can only match Zen 2. It's whooped by Zen 3, but the Xeon is so much cheaper that it still wins the price-performance ratio. If you happen to have a use for AVX-512, that's a big Intel win, too.So overall, Intel wins price/performance on every measure. It's a clear "second place" on the very important SPEC measures. It seems like a highly competitive option, but of course, watts are crucial! Availability and volume discounts are hard to measure, but I wish there was some better power consumption data presented. 2x56-core Intel 6330 chips cost a good chunk less than a single AMD 7f53. That's great, but how much more heat can the Intel be expected to generate? These servers could make great VM hosts, or they could be money pits - I want to understand efficiency better.
mode_13h - Tuesday, July 27, 2021 - link
Huh? It also lost on LLVM compile, Blender, SPECint, and Corona.> similar or superior per-thread performance on every workstation load tested
This basically is a nonsense metric, for workstations. Per-thread performance only really matters for cloud workloads. On workstations, you can just use bare hardware, which means you can use *all* of the threads it provides.