Comments Locked

24 Comments

Back to Article

  • Threska - Tuesday, March 21, 2023 - link

    Buy one of these and have ChatGPT at home.
  • satai - Tuesday, March 21, 2023 - link

    To run it? Yes (but actually no, you wouldn't get the weights probably, the closest think you can get hands on is probably to torrent LLaMA).

    To infer it? No.
  • brucethemoose - Tuesday, March 21, 2023 - link

    I'm sure the community will cook up some low-memory finetuning schemes for LLaMA or whatever else catches on. Alpaca was already tuned with relatively modest hardware.
  • satai - Tuesday, March 21, 2023 - link

    LLaMA is now down to 4b weights and I guess this is the end for now.
    So now you can live with something like 20GB or 24GB cards for medium size models.

    So - to run such a model is quite possible. (The issue is, that you won't get to such a model in a licence-OKish way probably.) To infer such a model... oh, that's a bit different story for now and probably for years (decades?) to come.
  • brucethemoose - Tuesday, March 21, 2023 - link

    Its much less when rounded down to 4 bits, probably even less with some clever swapping or frameworks like DeepSpeed, thoigh the larger LLaMA models are better.

    Maybe I am misinterpreting what you mean by "infer," but finetuning Stable Diffusion with LORA can comfortably squeeze onto 6GB cards (and less comfortably onto 4GB), where inference eats around 3.3GB for reference.
  • satai - Tuesday, March 21, 2023 - link

    You can use such a model on a mainstream card but you still can't construct it. So we are still dependent on who provides (leaks?) the already computed model.
  • brucethemoose - Tuesday, March 21, 2023 - link

    Yep.

    But I think the probability of good LLMs releasing/leaking is high.

    And again, stable diffusion is a good example of what happens after that. SD 1.5 alone is antiquated compared to newer Midjourney, Dall-E and such, but with the endless community finetunes and extensions, it blows the cloud models away.
  • atomek - Wednesday, March 22, 2023 - link

    Inferencing is actually "running it". You train network, and inferencing is executing inputs on trained model.
  • cappie - Tuesday, May 16, 2023 - link

    I think you're mixing up Inferencing and Training.. running the model is what they mean with inferencing.. training usually takes way more memory as you need the extra data for your gradients while performing any form of backprop
  • p1esk - Tuesday, March 21, 2023 - link

    The big question is if the memory is exposed as a unified 188GB, or as 2x94GB. I mean how will it show up in Pytorch?
  • Ryan Smith - Tuesday, March 21, 2023 - link

    The answer to that is however Pytorch would treat a dual H100 setup today. That part is unchanged.
  • p1esk - Tuesday, March 21, 2023 - link

    That is disappointing.
  • mode_13h - Wednesday, March 22, 2023 - link

    It's not surprising. The 600 GB/s link between cards is a mere 15% of the onboard bandwidth. If software naively treated it as a single GPU, performance would be garbage.

    Conversely, I'm sure it's now well-supported for software to divide up big networks across multiple GPUs and align the division with a layer boundary. If you do it that way, the NVLink is probably no bottleneck at all.
  • abufrejoval - Wednesday, April 19, 2023 - link

    While I understand your sentiment, it's physics.

    But playing around with Llama on 2 V100 (which can't do all the nice low precision weight things), I noticed that there was very little noticeable slowdown with the 2nd V100, which in my case only shared the PCIe bus, too.

    I guess the explanation is that Llama like the other LLM models are *already* split into many graphs to manage the updates of the weights during training, which happens on thousands of GPUs after all, which connect via Infiniband at best, because there is nothing else readily available to OpenAI.

    So the penalty and pain typically associated with models that outgrow a single GPU's memory space have already been worked around as much as possible with the current breed of LLMs and that is why they don't deteriorate as much as you'd think when they are spread for inference, too.

    I was even surprised to see how well Llama tolerated having some graphs moved to the CPU as well, where unfortunately my hardware (and the current PyTorch release) don't yet support the lower precision weights (and you can't seem to have mixed weight graphs due to software constraints), but that is about to change going forward, as CPU vendors don't want to loose out on the opportunity and would like to play their RAM size card.

    I think I could load the 13B (or 30B?)Llama into 768GB of RAM on a 28 core Skylake, but it was at one letter per second, if that, rather impractical.

    30B is much more fun on my RTX3090 with 24GB VRAM and 4-bit weights, and I believe I have seen code for 3-bit weights, too.

    Did not get to play around with that, yet.
  • brucethemoose - Tuesday, March 21, 2023 - link

    "12x the GPT3-175B inference throughput"

    That is a very interesting claim, as GPT-3 is a closed source model with precisely 1 user: OpenAI.

    The open source models are kinda being cobbled together into usable repos as I type this.
  • p1esk - Tuesday, March 21, 2023 - link

    The model architecture is well known, it was described in the paper. Weights have not been released, but they are not needed to measure the hw performance.
  • mode_13h - Wednesday, March 22, 2023 - link

    And who do you think supplies hardware to OpenAI?
  • p1esk - Wednesday, March 22, 2023 - link

    Microsoft
  • mode_13h - Thursday, March 23, 2023 - link

    For one thing, they only bought a controlling stake just a couple months ago.

    I'm talking about the hardware GPT-3 was developed on, last year.

    BTW, I'm sure Microsoft is just buying Nvidia GPUs. I've read they're doing stuff with FPGAs, but AFAIK MS has no comparable hardware solution for such large models.
  • abufrejoval - Wednesday, April 19, 2023 - link

    OpenAI is hosted on Azure. Have a look here:
    https://www.nextplatform.com/2023/03/21/inside-the...
  • brucethemoose - Tuesday, March 21, 2023 - link

    Also, AMD's and Intel's upcoming XPUs are a surprisingly good fit for this depending on how much RAM they can take. Just being able to *fit* mega model in a single memory pool is huge, even if they don't have the raw throughput of an HBM H100.
  • DigitalFreak - Wednesday, March 22, 2023 - link

    "customers aren’t getting access to quite all 96GB per card."

    I see Nvidia is up to their old tricks again. LOL
  • puttersonsale - Monday, March 27, 2023 - link

    they do this product but they axe SLI NVlink?

    They can totally support SLI and they are just trying to save $$$ from having to update drivers and etc.
  • Morawka - Sunday, April 9, 2023 - link

    AFAIK Nvidia killing off SLI had more to do with how modern game engines render 3d graphics than any desire to cut costs for their driver development team.

Log in

Don't have an account? Sign up now