It's not a completely stupid comment. At the time of the M2 Ultra reveal, a lot of people were mocking it as useless because it did not have the raw compute of high end nV or AMD. The obvious rejoinder to that is that compute is not the whole story, there is real value in less compute coupled to more RAM.
MI shows that this wasn't just copium. AMD ALSO believes there's real value associated with adding massive (ie more than current nV) amounts of RAM to a decent, even if it's not best in the world, amounts of compute.
You can squabble about the details: whether Apple doesn't have enough GPU compute, whether AMD's bandwidth is over-specced, whether SLC on Apple captures enough data reuse in the particular case of interest (training LLMs) to effectively match AMD's memory performance. But all of those are details; the single biggest point is that this is a validation of Apple's belief that raw compute ala nV is going in the wrong direction; that compute needs to be matched to a *large enough* RAM to match upcoming problems.
What are you talking about? Literally you are sprouting total rubbish. The M2 ultra is a desktop/workstation class APU, utilising DDR5 unified memory. HEDT or Workstations like the Mac pro normally require large amounts of physical system memory, and generous VRAM for graphics. Previous Mac pro could have 1.5tb of system ram and 48gb of VRAM. The new M2 needed at least 192GB of RAM, with the vast majority of it going to system RAM not VRAM. This is needed to handle CAD and other graphic/video editing requirements. The GPU horse power sucks compared to dedicated GPUS because of limited die space, not because of some push by Apple to reduce processing in lieu of adding more RAM. Apple is betting they can optimise the software to accommodate the less powerful hardware, giving a similar experience for these very narrow use cases. The MI300X has the huge amount of dedicated high bandwidth RAM (given the almost 7x bandwidth increase, the 192GB OF Ram is massively more efficient and performant the M2) to allow full LLM model parameters to be directly stored in memory for processing. The computing power of this chip is a complete monster, and it would be limited by the 24gb HBM3e chips, otherwise it would have even more RAM. You are literally trying to clutch straws and draw false equivalence between two massively different technologies utilised for completely different things, and my original comment now extends to you too.
lol what... have you been living under a rock? Large memory pools on AI systems have been standard for the better part of a decade. It's one of the reasons cpus are still used even though they are an order of magnitude slower than amd/nvidia gpus. good lord
So M2 Ultra not having enough GPU compute only really comes up in the context that they brought up. Training LLMs requires absurd amounts of compute, the training simply isn't worth your resources to be attempted on an M2 Ultra. However as far as inference goes, it may not be the fastest but it should be able to run future models with record setting number of parameters that can't be run on anything even remotely close to M2 Ultra in price. Assuming they get the software working ( Apple is pretty good at that ) The M2 Ultra will allow for some bleeding edge testing.
I just wanted to point out that when it comes to large LLMs ( AI is pretty much the reason why you would use this much ram ) bandwidth is king. The faster you can move data the faster you can run inference ( talking to the AI ) So Ideally we should see some incredible results with this thing for something like guanaco 65B and larger
Perhaps so, but let's try to swing the discussion back to the actual point I raised. We know that the current state of the art (as of this week, it changes month by month!) is that a 4090 can generate about 100 tok/s from a 7B parameter model. An M2 Max can generate about 40 tok/s. It's very unclear how the Ultra will scale, but the worst case is likely to be 1.5x, so giving perhaps 60 to 70 tok/s. Now that's not 4090 level, but it's not that far different.
This is of course inference, but again that's kinda the point. We have a machine that's good enough for interesting performance using existing algorithms and data sets, but with the potential to try things that are different and have not so far been attempted because of memory constraints. You aren't necessarily going to be training a a full LLM from scratch, but you can engage in a variety of interesting experiments in a way that's not practical when you are working with a shared resource.
For example connect this with the fact that the next macOS will come with embeddings for almost 50 languages. This opens up a space for "experimental linguistics" in a way that's never been possible before, and that allows amateurs or academics who are not interested in the LLM commercial space, to perform a large number of experiments. For example, as I have suggested elsewhere, you can experiment with creating embeddings using a sequence of text, first at the 3 year old level, then at the 4 year old level, then at the 5 year old, etc etc. How well does this sort of "training" (ie building successively richer embeddings) work compared to just gulping and trying to process the entire internet corpus? How do the results compare with the Apple embeddings at each stage? Are the results different if you follow this procedure for different languages? How do the Apple embeddings for different languages differ in "richness", and is there some sort of core where they are all the same? What happens if you "blind" an LLM by knocking out certain portions of the embedding? etc etc etc All this stuff becomes available now in a way that was not the case a year ago, and it should be wildly exciting to anyone interested in any aspect of language!
Now maybe Mi contributes to this exercise in some way? I hope so, but I think Mi is a different sort of product that will not be purchased by amateur experimental linguists to the same extent.
Technically, MI300X surpasses M2 Ultra, because M2 Ultra shares the memory with the CPU cores, so some of that 192GB is already and always allocated to the CPU cores, and can't be utilized by the GPU, even if it was 1-2GB, its still there and the GPU can't use it.
But, MI300X can use everything, not to mention the much higher bandwidth the MI300X has.
But again, M2 Ultra and MI300X are two different products for two different usages, its totally stupid to compare these both in the first place, Apple doesn't have anything to compare with MI300X, period.
M2 Ultra got destroyed in ST and MT performance by a mere Mainstream i9 13900K as well in the R9 7950X granted it's all in that garbage Geekbench software still its a win for x86 and a big L for Apple. Then it got GAPPED by AMD's older TR Zen 3 parts. Second aspect is it got destroyed by a 4060Ti in OpenCL, the fully specced M2 Ultra.
M2 Ultra machine costs $6000 base with crippled GPU and not full capacity 192GB of DRAM. While the Mainstream parts with $6000 can get you top notch specifications for the price with a 4090 that will destroy M2.
Apple stole the Wine and using it to port some games to their stupid platform which has EOL policies decided by a mere OS release like axing 32Bits, older HW and etc. The company's products are for consumers not HPC and certainly not for Enthusiasts even. They are made for an Avg Joe for the vanity factor to offset the premium luxe addiction of modern people.
Now MI300 is a compute monster it destroys Nvidia's Hopper H100 with more HBM and more FP32 performance only thing it lacks is Tensor cores like Nvidia. Also Instinct has a Zen 4 CPU cluster inside it, this thing is an Exascale HPC AI monster machine and it will be used in the Supercomputers. How can a stupid M2 Ultra soldered pile of rubble with proprietary trash SSD and even dare to look at it's spec sheet and technical achievements ?
Please don't pollute the comment section with your Apple fanaticism on various Intel, AMD and Nvidia hardware.
AI training and inference is one thing that the M2 would absolutely obliterate any AMD and Intel mainstream platforms. There just isn't enough memory bandwidth to be competitive.
Just checked the price of Nividia's H100 GPU AI accelerator with 80GB memory- it retails for 28 kUSD on Amazon. A Mac Studio M2 Ultra with 76 GPU cores and 192GB RAM starts at 6.6 kUSD, which is 4 times less.
So we can reasonably expect the MI300X with 192GB RAM to be several times the price for the Mac Studio with 192GB RAM.
Which probably makes the Mac Studio an interesting choice to run large language models in RAM, even if it is several times slower than the MI300X.
The thing about GPGPU is not just rendering graphics, its also about the API support. Nvidia being Nvidia has largely to do with the ease of use of their cuda API. AMD is playing catch-up here but they still don't have a user-friendly OpenCL devel lib on windows. i.e. not even python can use AMD graphics for compute, you have to hand-code opencl source file. Apple is somewhat in between, thanks to their closed ecosystem, they only need to support their own operating system for GPU acceleration. However, from my experience with Apple's M1, M1 max and M2 series chips, I'm kinda suspicious about the interconnect of their chips.
Just checked the price of Nividia's H100 GPU AI accelerator with 80GB memory- it retails for 28 kUSD on Amazon. A Mac Studio M2 Ultra with 76 GPU cores and 192GB RAM starts at 6.6 kUSD, which is 4 times less.
So we can reasonably expect the MI300X with 192GB RAM to be several times the price for the Mac Studio with 192GB RAM.
Which probably makes the Mac Studio an interesting choice to run large language models in RAM, even if it is several times slower than the MI300X.
Joining the previously announced 128GB MI300 APU, which is now being called the MI300A, AMD is also producing a pure GPU part using the same design. This chip, dubbed the MI300X, uses just CDNA 3 GPU tiles rather than a mix of CPU and GPU tiles in the MI300A, making it a pure, high-performance GPU that gets paired with 192GB of HBM3 memory. Aimed squarely at the large language model market, the MI300X is designed for customers who need all the memory capacity they can get to run the largest of models.
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
28 Comments
Back to Article
lemurbutton - Tuesday, June 13, 2023 - link
Congratulations to AMD. Its MI300X GPU will match the M2 Ultra in GPU memory capacity.mdriftmeyer - Tuesday, June 13, 2023 - link
It surpasses it by 96bps on an 8192 bit bus. In short, it stomps all over it.hecksagon - Tuesday, June 13, 2023 - link
MI300X his 5200 gb/s compared to the M2's 800gb/s. It isn't even close.Gm2502 - Tuesday, June 13, 2023 - link
How to show you know nothing about computer architecture without saying you know nothing. Rates as one of the stupidest comments of the year...name99 - Tuesday, June 13, 2023 - link
It's not a completely stupid comment.At the time of the M2 Ultra reveal, a lot of people were mocking it as useless because it did not have the raw compute of high end nV or AMD. The obvious rejoinder to that is that compute is not the whole story, there is real value in less compute coupled to more RAM.
MI shows that this wasn't just copium. AMD ALSO believes there's real value associated with adding massive (ie more than current nV) amounts of RAM to a decent, even if it's not best in the world, amounts of compute.
You can squabble about the details: whether Apple doesn't have enough GPU compute, whether AMD's bandwidth is over-specced, whether SLC on Apple captures enough data reuse in the particular case of interest (training LLMs) to effectively match AMD's memory performance.
But all of those are details; the single biggest point is that this is a validation of Apple's belief that raw compute ala nV is going in the wrong direction; that compute needs to be matched to a *large enough* RAM to match upcoming problems.
Gm2502 - Tuesday, June 13, 2023 - link
What are you talking about? Literally you are sprouting total rubbish. The M2 ultra is a desktop/workstation class APU, utilising DDR5 unified memory. HEDT or Workstations like the Mac pro normally require large amounts of physical system memory, and generous VRAM for graphics. Previous Mac pro could have 1.5tb of system ram and 48gb of VRAM. The new M2 needed at least 192GB of RAM, with the vast majority of it going to system RAM not VRAM. This is needed to handle CAD and other graphic/video editing requirements. The GPU horse power sucks compared to dedicated GPUS because of limited die space, not because of some push by Apple to reduce processing in lieu of adding more RAM. Apple is betting they can optimise the software to accommodate the less powerful hardware, giving a similar experience for these very narrow use cases. The MI300X has the huge amount of dedicated high bandwidth RAM (given the almost 7x bandwidth increase, the 192GB OF Ram is massively more efficient and performant the M2) to allow full LLM model parameters to be directly stored in memory for processing. The computing power of this chip is a complete monster, and it would be limited by the 24gb HBM3e chips, otherwise it would have even more RAM. You are literally trying to clutch straws and draw false equivalence between two massively different technologies utilised for completely different things, and my original comment now extends to you too.lemurbutton - Tuesday, June 13, 2023 - link
It uses LPDDR5X, not DDR5.Gm2502 - Wednesday, June 14, 2023 - link
Fair point, used DDR5 as a catch all, should have been more specific.whatthe123 - Wednesday, June 14, 2023 - link
lol what... have you been living under a rock? Large memory pools on AI systems have been standard for the better part of a decade. It's one of the reasons cpus are still used even though they are an order of magnitude slower than amd/nvidia gpus. good lordMINIMAN10000 - Thursday, June 15, 2023 - link
So M2 Ultra not having enough GPU compute only really comes up in the context that they brought up. Training LLMs requires absurd amounts of compute, the training simply isn't worth your resources to be attempted on an M2 Ultra. However as far as inference goes, it may not be the fastest but it should be able to run future models with record setting number of parameters that can't be run on anything even remotely close to M2 Ultra in price. Assuming they get the software working ( Apple is pretty good at that ) The M2 Ultra will allow for some bleeding edge testing.I just wanted to point out that when it comes to large LLMs ( AI is pretty much the reason why you would use this much ram ) bandwidth is king. The faster you can move data the faster you can run inference ( talking to the AI ) So Ideally we should see some incredible results with this thing for something like guanaco 65B and larger
Makaveli - Tuesday, June 13, 2023 - link
Not really surprising obvious apple fanboy only understands marketing slides and not the actual technology. MI300X is going to stomp!name99 - Wednesday, June 14, 2023 - link
Perhaps so, but let's try to swing the discussion back to the actual point I raised.We know that the current state of the art (as of this week, it changes month by month!) is that a 4090 can generate about 100 tok/s from a 7B parameter model. An M2 Max can generate about 40 tok/s. It's very unclear how the Ultra will scale, but the worst case is likely to be 1.5x, so giving perhaps 60 to 70 tok/s. Now that's not 4090 level, but it's not that far different.
This is of course inference, but again that's kinda the point. We have a machine that's good enough for interesting performance using existing algorithms and data sets, but with the potential to try things that are different and have not so far been attempted because of memory constraints. You aren't necessarily going to be training a a full LLM from scratch, but you can engage in a variety of interesting experiments in a way that's not practical when you are working with a shared resource.
For example connect this with the fact that the next macOS will come with embeddings for almost 50 languages. This opens up a space for "experimental linguistics" in a way that's never been possible before, and that allows amateurs or academics who are not interested in the LLM commercial space, to perform a large number of experiments. For example, as I have suggested elsewhere, you can experiment with creating embeddings using a sequence of text, first at the 3 year old level, then at the 4 year old level, then at the 5 year old, etc etc. How well does this sort of "training" (ie building successively richer embeddings) work compared to just gulping and trying to process the entire internet corpus? How do the results compare with the Apple embeddings at each stage? Are the results different if you follow this procedure for different languages? How do the Apple embeddings for different languages differ in "richness", and is there some sort of core where they are all the same? What happens if you "blind" an LLM by knocking out certain portions of the embedding? etc etc etc
All this stuff becomes available now in a way that was not the case a year ago, and it should be wildly exciting to anyone interested in any aspect of language!
Now maybe Mi contributes to this exercise in some way? I hope so, but I think Mi is a different sort of product that will not be purchased by amateur experimental linguists to the same extent.
Xajel - Wednesday, June 14, 2023 - link
Technically, MI300X surpasses M2 Ultra, because M2 Ultra shares the memory with the CPU cores, so some of that 192GB is already and always allocated to the CPU cores, and can't be utilized by the GPU, even if it was 1-2GB, its still there and the GPU can't use it.But, MI300X can use everything, not to mention the much higher bandwidth the MI300X has.
But again, M2 Ultra and MI300X are two different products for two different usages, its totally stupid to compare these both in the first place, Apple doesn't have anything to compare with MI300X, period.
Silver5urfer - Wednesday, June 14, 2023 - link
M2 Ultra got destroyed in ST and MT performance by a mere Mainstream i9 13900K as well in the R9 7950X granted it's all in that garbage Geekbench software still its a win for x86 and a big L for Apple. Then it got GAPPED by AMD's older TR Zen 3 parts. Second aspect is it got destroyed by a 4060Ti in OpenCL, the fully specced M2 Ultra.M2 Ultra machine costs $6000 base with crippled GPU and not full capacity 192GB of DRAM. While the Mainstream parts with $6000 can get you top notch specifications for the price with a 4090 that will destroy M2.
Apple stole the Wine and using it to port some games to their stupid platform which has EOL policies decided by a mere OS release like axing 32Bits, older HW and etc. The company's products are for consumers not HPC and certainly not for Enthusiasts even. They are made for an Avg Joe for the vanity factor to offset the premium luxe addiction of modern people.
Now MI300 is a compute monster it destroys Nvidia's Hopper H100 with more HBM and more FP32 performance only thing it lacks is Tensor cores like Nvidia. Also Instinct has a Zen 4 CPU cluster inside it, this thing is an Exascale HPC AI monster machine and it will be used in the Supercomputers. How can a stupid M2 Ultra soldered pile of rubble with proprietary trash SSD and even dare to look at it's spec sheet and technical achievements ?
Please don't pollute the comment section with your Apple fanaticism on various Intel, AMD and Nvidia hardware.
hecksagon - Wednesday, June 14, 2023 - link
AI training and inference is one thing that the M2 would absolutely obliterate any AMD and Intel mainstream platforms. There just isn't enough memory bandwidth to be competitive.Gm2502 - Wednesday, June 14, 2023 - link
Are you high? Please share some proof there my Ill informed keyboard warrior.Trackster11230 - Thursday, June 15, 2023 - link
Please also don't pollute it with your blind Apple hatred.Gm2502 - Thursday, June 15, 2023 - link
You are literally just posting rabid Apple make believe crap.Trackster11230 - Friday, June 16, 2023 - link
This is the first comment I've posted on Anandtech in over a year. What are you talking about.ballsystemlord - Tuesday, June 13, 2023 - link
I want one.Now I just need to figure out how to get it without selling both of my kidneys and heart.
jeromec - Wednesday, June 14, 2023 - link
Just checked the price of Nividia's H100 GPU AI accelerator with 80GB memory- it retails for 28 kUSD on Amazon.A Mac Studio M2 Ultra with 76 GPU cores and 192GB RAM starts at 6.6 kUSD, which is 4 times less.
So we can reasonably expect the MI300X with 192GB RAM to be several times the price for the Mac Studio with 192GB RAM.
Which probably makes the Mac Studio an interesting choice to run large language models in RAM, even if it is several times slower than the MI300X.
Mizuki191 - Wednesday, June 14, 2023 - link
gudluck no ECC RAM..hecksagon - Wednesday, June 14, 2023 - link
Not really needed for LLM. Small infrequent errors are not likely to affect the actual functioning of the model.erinadreno - Wednesday, June 14, 2023 - link
The thing about GPGPU is not just rendering graphics, its also about the API support. Nvidia being Nvidia has largely to do with the ease of use of their cuda API. AMD is playing catch-up here but they still don't have a user-friendly OpenCL devel lib on windows. i.e. not even python can use AMD graphics for compute, you have to hand-code opencl source file. Apple is somewhat in between, thanks to their closed ecosystem, they only need to support their own operating system for GPU acceleration.However, from my experience with Apple's M1, M1 max and M2 series chips, I'm kinda suspicious about the interconnect of their chips.
Zoolook - Sunday, June 18, 2023 - link
Using CUDA with ROCm via HIP works quite well, and products like these new INSTINCT cards should help accelerate the development of the APIs.jeromec - Wednesday, June 14, 2023 - link
Any idea of the price for this thing?(which might add to the comparability with M2 Ultra btw)
jeromec - Wednesday, June 14, 2023 - link
Just checked the price of Nividia's H100 GPU AI accelerator with 80GB memory- it retails for 28 kUSD on Amazon.A Mac Studio M2 Ultra with 76 GPU cores and 192GB RAM starts at 6.6 kUSD, which is 4 times less.
So we can reasonably expect the MI300X with 192GB RAM to be several times the price for the Mac Studio with 192GB RAM.
Which probably makes the Mac Studio an interesting choice to run large language models in RAM, even if it is several times slower than the MI300X.
(sorrry I initially replied to the wrong message)
scientist223311 - Wednesday, June 21, 2023 - link
Joining the previously announced 128GB MI300 APU, which is now being called the MI300A, AMD is also producing a pure GPU part using the same design. This chip, dubbed the MI300X, uses just CDNA 3 GPU tiles rather than a mix of CPU and GPU tiles in the MI300A, making it a pure, high-performance GPU that gets paired with 192GB of HBM3 memory. Aimed squarely at the large language model market, the MI300X is designed for customers who need all the memory capacity they can get to run the largest of models.