Do they have any metrics or benchmarks more oriented to the training side of the pipeline, and will you be incorporating any of these inference tests into Anandtech's benchmarking suite? (Interesting inference results when factoring in power: seems like Qualcomm's solution stands out for perf / W.)
They do, but those are released at a different time. The latest round of training benchmarks are the version 0.7 HPC Training benchmarks released November 17, 2020. To my eye there's not much there: Lawrence Berkley showing how bad the Xeon Phi is at training compared to the V100, Fujitsu showing how much better the V100 is at training than the A64FX ARM processor in Fugaku, and the Swiss Supercomputing Center showing how outdated the P100 is for training.
The latest normal training benchmarks are the version 0.7 benchmarks released July 29, 2020. https://mlcommons.org/en/training-normal-07/ I read somewhere in passing that the 1.0 training results are due out in 3 months.
As far as Qualcomm's inference results, it only submitted results for two models. The published results only show its solution standing out in perf/W in Resnet-50, which is a relatively small CNN. My guess is that it doesn't stand out much in larger and non-CNN models.
Of course there are arguments there for both Phi/A64FX. Phi wasn't really built with ML in mind, and Knights Mill was a bit of a hack in the end. A64FX was built for HPC, not ML. NV has been arming its arch with ML in mind for several generations now. Speaking with a lot of the custom AI chip companies, it seems that their customers aren't too interested in MLPerf numbers, as their own models seem to be different enough that it's not worth the AI chip companies even bothering to run MLPerf and submit results.
The Fugaku supercomputer is definitely phenomenal. (expensive too). But the mlperf results show that it's not very efficient for ai training. As for Xeon Phi, it was a complete misstep.
It is quite literally Fujitsu benchmarking a v100 system and also benchmarking a slice of Fugaku, and LBNL benchmarking a v100 system and also benchmarking a Xeon Phi system.
Regarding the ai startups, they can't possibly engage personally with enterprises. They would benefit immensely from mlperf results, they just can't easily run the benchmarks yet, at least not with results that show their value. And if they can't easily run the benchmarks yet they can't easily adapt to whatever models people are actually running. That's why the uptake of these chips have so far mostly been by institutions whose purpose for buying the machines include researching and evaluating the technology in the machine for. As for the dearth of mlperf results from the ai chip startups, I refuse to believe that any ai startup running the table with mlperf wouldn't create a buzz that would have potential customers knocking at its door and potential investors clamoring to give more money. And it's not like their bookings are full as it is.
Most of the well-funded chips are training-focused. Hopefully we'll see some results from Graphcore, Sambanova, Cerebras, and Habana in the upcoming training results. These companies seem to have no trouble using these various standard models in the marketing slides so it's not like they ignore them. The whole point of mlperf was to progress from the point where everyone cherry picks favorable results with carefully controlled parameter choices to compare themselves with their competitors.
> Speaking with a lot of the custom AI chip companies, it seems that their customers aren't too interested in MLPerf numbers
Speaking with a lot of foxes, it seems that hens aren't really interested in more secure chicken coops...
LOL. Maybe deep-pocketed, cutting-edge AI researchers aren't too interested in MLPerf numbers, but the bulk of the market isn't using such cutting edge stuff. Maybe some older networks can be dropped from these metrics, but I think the primary motive in AI chip makers downplaying the importance of benchmarks is mostly because they fear they wouldn't top the leader board (or not for long, even if they're currently in front).
Honestly, without benchmarks, how are people supposed to make informed decisions? We can't all trial all solutions and test them on our current stuff. Even then, everyone wants room to grow and to have some sense that when we get to the point of deploying the next gen networks we won't be stuck with HW that won't play ball. This is probably a large part of the enduring popularity of GPUs -- because they're among the most flexible solutions.
The biggest problem is that the AI landscape is still progressing quite quickly, so most of these benchmarks are outdated in too short a time frame and can be misleading.
At least those looking at ML benchmarks are hopefully better informed as to what they mean so they can use them correctly. It will probably turn into a market stunt instead though, with ill-informed CTO's wanting that one cause higher rank even though doesn't fit use-case, seen it happen so often.
Good points, but I'm still reluctant to agree that having no standard benchmarks is a better situation than having a few outdated ones. If the main problem is staying current, then refreshing them every couple years could at least help.
They are refreshing them more often than that. That's why the AI startups can't keep up. Again, if they can't keep up with the standard benchmarks they can't keep up with the myriad of models. With inference the trained models need to be tuned to run well on the hardware. And even in a mature industry, benchmarks don't tell you how a particular product will work in your workload. An overview of several benchmarks gives one a clue of where one can look. That's why it's important to submit results to more than one or two benchmarks. At this point, resnet-50 isn't very useful, not by itself anyway.
The results today are all focused around inference – the ability of a trained network to process incoming unseen data. The tests are built around a number of machine learning areas and models attempting to represent the wider ML market, in the same way that SPEC2017 tries to capture common CPU workloads. For MLPerf Inference, this includes: https://geometry-dash.io
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
11 Comments
Back to Article
Raqia - Wednesday, April 21, 2021 - link
Do they have any metrics or benchmarks more oriented to the training side of the pipeline, and will you be incorporating any of these inference tests into Anandtech's benchmarking suite? (Interesting inference results when factoring in power: seems like Qualcomm's solution stands out for perf / W.)Yojimbo - Wednesday, April 21, 2021 - link
They do, but those are released at a different time. The latest round of training benchmarks are the version 0.7 HPC Training benchmarks released November 17, 2020. To my eye there's not much there: Lawrence Berkley showing how bad the Xeon Phi is at training compared to the V100, Fujitsu showing how much better the V100 is at training than the A64FX ARM processor in Fugaku, and the Swiss Supercomputing Center showing how outdated the P100 is for training.The latest normal training benchmarks are the version 0.7 benchmarks released July 29, 2020. https://mlcommons.org/en/training-normal-07/
I read somewhere in passing that the 1.0 training results are due out in 3 months.
As far as Qualcomm's inference results, it only submitted results for two models. The published results only show its solution standing out in perf/W in Resnet-50, which is a relatively small CNN. My guess is that it doesn't stand out much in larger and non-CNN models.
Ian Cutress - Thursday, April 22, 2021 - link
Of course there are arguments there for both Phi/A64FX. Phi wasn't really built with ML in mind, and Knights Mill was a bit of a hack in the end. A64FX was built for HPC, not ML. NV has been arming its arch with ML in mind for several generations now. Speaking with a lot of the custom AI chip companies, it seems that their customers aren't too interested in MLPerf numbers, as their own models seem to be different enough that it's not worth the AI chip companies even bothering to run MLPerf and submit results.Yojimbo - Thursday, April 22, 2021 - link
The Fugaku supercomputer is definitely phenomenal. (expensive too). But the mlperf results show that it's not very efficient for ai training. As for Xeon Phi, it was a complete misstep.It is quite literally Fujitsu benchmarking a v100 system and also benchmarking a slice of Fugaku, and LBNL benchmarking a v100 system and also benchmarking a Xeon Phi system.
Regarding the ai startups, they can't possibly engage personally with enterprises. They would benefit immensely from mlperf results, they just can't easily run the benchmarks yet, at least not with results that show their value. And if they can't easily run the benchmarks yet they can't easily adapt to whatever models people are actually running. That's why the uptake of these chips have so far mostly been by institutions whose purpose for buying the machines include researching and evaluating the technology in the machine for. As for the dearth of mlperf results from the ai chip startups, I refuse to believe that any ai startup running the table with mlperf wouldn't create a buzz that would have potential customers knocking at its door and potential investors clamoring to give more money. And it's not like their bookings are full as it is.
Most of the well-funded chips are training-focused. Hopefully we'll see some results from Graphcore, Sambanova, Cerebras, and Habana in the upcoming training results. These companies seem to have no trouble using these various standard models in the marketing slides so it's not like they ignore them. The whole point of mlperf was to progress from the point where everyone cherry picks favorable results with carefully controlled parameter choices to compare themselves with their competitors.
mode_13h - Friday, April 23, 2021 - link
> Speaking with a lot of the custom AI chip companies, it seems that their customers aren't too interested in MLPerf numbersSpeaking with a lot of foxes, it seems that hens aren't really interested in more secure chicken coops...
LOL. Maybe deep-pocketed, cutting-edge AI researchers aren't too interested in MLPerf numbers, but the bulk of the market isn't using such cutting edge stuff. Maybe some older networks can be dropped from these metrics, but I think the primary motive in AI chip makers downplaying the importance of benchmarks is mostly because they fear they wouldn't top the leader board (or not for long, even if they're currently in front).
Honestly, without benchmarks, how are people supposed to make informed decisions? We can't all trial all solutions and test them on our current stuff. Even then, everyone wants room to grow and to have some sense that when we get to the point of deploying the next gen networks we won't be stuck with HW that won't play ball. This is probably a large part of the enduring popularity of GPUs -- because they're among the most flexible solutions.
RSAUser - Friday, April 30, 2021 - link
The biggest problem is that the AI landscape is still progressing quite quickly, so most of these benchmarks are outdated in too short a time frame and can be misleading.At least those looking at ML benchmarks are hopefully better informed as to what they mean so they can use them correctly. It will probably turn into a market stunt instead though, with ill-informed CTO's wanting that one cause higher rank even though doesn't fit use-case, seen it happen so often.
mode_13h - Friday, April 30, 2021 - link
Good points, but I'm still reluctant to agree that having no standard benchmarks is a better situation than having a few outdated ones. If the main problem is staying current, then refreshing them every couple years could at least help.Yojimbo - Friday, April 30, 2021 - link
They are refreshing them more often than that. That's why the AI startups can't keep up. Again, if they can't keep up with the standard benchmarks they can't keep up with the myriad of models. With inference the trained models need to be tuned to run well on the hardware. And even in a mature industry, benchmarks don't tell you how a particular product will work in your workload. An overview of several benchmarks gives one a clue of where one can look. That's why it's important to submit results to more than one or two benchmarks. At this point, resnet-50 isn't very useful, not by itself anyway.brewerfaith - Wednesday, May 5, 2021 - link
The results today are all focused around inference – the ability of a trained network to process incoming unseen data. The tests are built around a number of machine learning areas and models attempting to represent the wider ML market, in the same way that SPEC2017 tries to capture common CPU workloads. For MLPerf Inference, this includes: https://geometry-dash.iomode_13h - Wednesday, May 5, 2021 - link
These spammers just copy a paragraph out of the article and append their link.denisfrancis - Wednesday, January 19, 2022 - link
nice information about ML and <a href="https://cocondeals.com/category/computer/">... Thanks a lot for sharing it!