Comments Locked

34 Comments

Back to Article

  • ThereSheGoes - Tuesday, June 29, 2021 - link

    Sounds like someone is drinking the marketing cool aid. Late is late is late.
  • at_clucks - Tuesday, June 29, 2021 - link

    No, they're launching the totally marketable tech that boosts performance right up until they finish selling the generation, discover that was a massive security hole, deactivate it wiping out years worth of performance increases, and then launch the new generation Crapfire Rapids CPU which will bring some totally marketable tech that boosts performance right up until they finish selling the generation...
  • lilo777 - Wednesday, June 30, 2021 - link

    Too many folks here recently have been working hard trying to turn AnandTech into wccftech.
  • at_clucks - Wednesday, June 30, 2021 - link

    Is it working?
  • at_clucks - Wednesday, June 30, 2021 - link

    Clearly not... as disgusting as the Disqus commenting system is, at least it has an edit button.

    On a different topic, Intel has a propensity towards touting various performance improvements that will make everything great again only to later realize they were half-assed and sacrificed security for a short lived benchmarking gain (that stuff that looks good on those marketing slides everyone eats up). The customers are left holding the castrated chips that they paid top dollar for, and are encouraged to go for the new generation which has the next generation of various performance improvements that will make everything great again only to later realize... But I'm sure AMX and DSA on SR won't have the same fate. ;)

    https://www.anandtech.com/show/6355/intels-haswell...

    https://www.phoronix.com/scan.php?page=news_item&a...
  • mode_13h - Thursday, July 1, 2021 - link

    > Crapfire Rapids

    Hilarious.

    I'll probably start referring to Sapphire Rapids as "Tire Fire Rapids" or maybe "Sapphire Tar Pits", if it starts to go the same way as Ice Lake SP.
  • Gondalf - Wednesday, June 30, 2021 - link

    Being better than AMD offering by a wide margin, it is not late.
    AMD have not an answer since Zen 4 will come out at the end next year.
  • mode_13h - Thursday, July 1, 2021 - link

    AMD could surprise everybody with a Zen3 EPYC that has stacked SRAM cache and new IO die. Not saying they will, but there are options besides Zen4.
  • Qasar - Thursday, July 1, 2021 - link

    " Being better than AMD offering by a wide margin, it is not late. " until its released, its just opinion and speculation, specially if gondalf posts anything about it, or intel, is usually just anti amd BS from him anyway.
  • arashi - Tuesday, July 20, 2021 - link

    Gondaft is just Dylan Patel's smurf.
  • AshlayW - Friday, July 9, 2021 - link

    That is exactly what is coming. Zen3+ EPYC with over 1GB L3 Stacked Cache per socket is coming, Sapphire Rapids is more or less DOA except for a few niche AI cases that are better off with NVIDIA GPUs anyway.
  • mode_13h - Saturday, July 10, 2021 - link

    Yeah, though AMX will certainly be an interesting feature. Intel has some other goodies in there, like DSA (Data Streaming Accelerators). Depending on how they use the HBM, Sapphire Rapids will still have some niches and benchmarks where it excels.
  • Unashamed_unoriginal_username_x86 - Tuesday, June 29, 2021 - link

    Incredible! Intel really is clawing it's way back to the top with this aggressive cadence, launching in H3 2021!
  • Arsenica - Tuesday, June 29, 2021 - link

    Don't worry about dates, worry about the new exploits soon to be enabled by DSA!
  • JayNor - Tuesday, June 29, 2021 - link

    "enabling at least a 2x performance increase over Ice Lake Xeon silicon with AVX512enabling at least a 2x performance increase over Ice Lake Xeon silicon with AVX512"

    Ice Lake doesn't have bfloat16 support. How does AMX performance compare to Cooper Lake?
  • SystemsBuilder - Tuesday, June 29, 2021 - link

    Agree. The "enabling at least a 2x performance increase over Ice Lake Xeon silicon with AVX512" statement makes no sense. AMX is BF16 floats (and byte + dword accumulate) and Icelake does not support BF16 (closest is FP32). So yeah if you compare BF16 vs FP32 you probably get 2x throughput but at half the precision (though exponent range is the same) just by changing format and adjusting the FPU units in AMX -> that would mean no additional improvement beyond format. Cooper Lake does support BF16 though and that is the interesting like for like comparison.
  • mode_13h - Thursday, July 1, 2021 - link

    > How does AMX performance compare to Cooper Lake?

    AMX is probably going to be one of those 10x improvements we see, every now and then. The question will be how relevant it is, by the time SPR actually launches.

    If the hyperscalers and cloud operators already mid-transition to HW AI accelerators, by then, it could be worth little more than a few big benchmarks to help shift the GeoMean.
  • SystemsBuilder - Thursday, July 1, 2021 - link

    AMX could potentially be like a ~10x magnitude BF16 float improvement over cooper lake in VERY special cases. In commercial AI the matrices and data sets are big (>>larger than cache) -> the bottleneck is not the FPU capabilities of the CPU (AVX-512 or AMX) but the memory bandwidth/latency. AVX-512 is already quite powerful so it can fully saturates the DRAM bandwidth doing large scale FP32 matrices. The BF16 format alone allows a 2x improvement over FP32 (16 bits vs 32 bits so 2 elements instead of 1 with the same data bandwidth). Unless Intel does something dramatic witch the cache and this DSR “thing” (like near L1 cache level bandwidth for GBs of data), the AMX will sit mainly idle waiting for data to feed it. The area where AMX would potentially be able to pull an out ~10x FP throughput improvement could be for small matrices that can be kept entirely in the AMX 8 T registers (8KB of configurable matrix registers) and L1 cache. If the matrices are significantly bigger than that, then you end up with memory bandwidth being the bottle neck again... there is only so much cache and register files can only hide.
  • mode_13h - Friday, July 2, 2021 - link

    > In commercial AI the matrices and data sets are
    > big (>>larger than cache) -> the bottleneck is not the FPU capabilities
    > of the CPU (AVX-512 or AMX) but the memory bandwidth/latency.

    A close reading of AMX seems to suggest it's as much a data-movement optimization as a compute optimization. It seems ideally-suited to accelerate small, non-separable 2D convolutions. The data movement involved makes that somewhat painful to optimize with instructions like AVX-512.

    From what I've seen, about the only things it currently does are load tiles and compute dot-products (at 8-bit and BFloat16 precision).

    > The BF16 format alone allows a 2x improvement over FP32 (16 bits vs 32 bits so
    > 2 elements instead of 1 with the same data bandwidth).

    Since Ivy Bridge, they've had instructions to do fp16 <-> fp32 conversion. So, although I know fp16 isn't as good as BF16, you could already do a memory bandwidth optimization by storing your data in fp16 and only converting it to fp32 after loading it.

    > If the matrices are significantly bigger than that, then you end up with memory
    > bandwidth being the bottle neck again... there is only so much cache and
    > register files can only hide.

    Yes, but... a number of the AI accelerators I'm seeing coming to market don't have a whole lot more on-die SRAM than modern server CPUs. And some don't even have more memory bandwidth, either. That tells me they're probably doing some clever data movement optimizations around batching, where they leave parts of the model on chip and run the input data through, before swapping in the next parts of the model.
  • SystemsBuilder - Saturday, July 3, 2021 - link

    Some points on AMX/AVX-512 and Neural Nets – and sorry for the long post...
    1. BF16 is not FP16 (as you said) and at least in my experience FP16 just isn't good enough in training scenarios. BF16 is a good compromise between precision and bandwidth and will do just fine - BF16 was created for AI because FP16 is not good enough (in training)... so streaming FP16 and then upconverting to fp32 is not good enough path... I think we are saying the same on this.
    2. AMX will make BF16 matrix multiplications on Sapphirelake much easier compared to Cooperlake since we would only need one instruction to multiply 2 matrix tiles (plus of course initial tile configuration, tile load and final store). So yes it will be easier to program and faster when executed compared to AVX-512 for BF16 matrix multiplications just for those reasons alone (assuming AMX have competitive clock cycles improvements over AVX-152). Having said that, AVX-512 tile x tile multiplications are not super hard either (although in FP32 so half speed). It basically involves repetitive use of the read ahead (prefetcht0) and the vfmadd231ps instructions (apologize for the low level technical detail but it’s key). AVX-512 dgemm implementations use the same approach and can get close to fully saturating the memory bandwidth and get close to theoretical max FP32 FLOPS (for CPUs with 2 AVX-512 units per core). So how much can AMX improve on this? (beyond BF16 compression advantage).
    3. Matrix multiplications requires at least O(i*j) +O(j*k) reads, O(i*k) writes and O(i*j*k) FMAs (multiply and adds) (O = big O notation – comp sci theory…):
    So if i, j, k (dimensions of the matrices) are large enough (>> than cache dimensions), it mathematically does not matter how clever you are with caching or how fast you can compute etc. Mathematically, you still need to read every single matrix element from RAM at least once and that (the RAM bandwidth) is the theoretical max throughput you can get. Best case: all data is in cache and FPU units are never Idle (= all matrices are in cache when needed) and that’s not happening anytime soon. My point is that RAM bandwidth is the fundamental bottle neck and AMX matrix FLOPS speed advantage over Cooperlake needs a corresponding bump in streaming bandwidth - maybe the DSA “thing” will do just that… if not, intel would just have upgraded the matrix “engine” but can’t feed it fast enough…
    4. back to AI: Without getting too technical (and I’m sure I have over stepped many times already): neural nets (whatever flavor you are using), requires tons of matrix multiplications of sometimes very large matrices (>> cache) during training, which additionally, are algorithmically serially dependent on each other (e.g. in the feed forward and back ward prop case you need to finish the calculation of one layer before you can move on to the next forward and backward saving and reading them in between). Therefore it is fundamentally a data flow and bandwidth constrained problem not a FPU constrained problem. Batching/ mini batching of the training data just brings down the size for each iteration but you still need to work through the entire training set and then do it all over again and again for hundreds of times until convergence. Net net: the nature of the AI algorithms makes it inherently difficult to break through the RAM speed limit.
  • mode_13h - Sunday, July 4, 2021 - link

    > in my experience FP16 just isn't good enough in training scenarios.

    People say BF16 converges faster, but Nvidia did pretty well with FP16 for both the P100, V100, and Turing. They didn't get around to adding BF16 until Ampere.

    > 2. AMX will make BF16 matrix multiplications on Sapphirelake much easier compared
    > to Cooperlake since we would only need one instruction to multiply 2 matrix tiles

    From what I've seen, it supports only dot-product, not proper matrix-multiply. They also haven't specified the throughput or latency. Don't assume you can issue one of those every cycle.

    > Mathematically, you still need to read every single matrix element from RAM at least once

    But you can hold one of the matrices on-chip (i.e. the model, which doesn't change) and just read the data that's being forward propagated. Because models are too big to fit entirely on-chip, you break them into pieces and run a batch of data through, before loading and applying the next piece.

    The reason this helps is that models are typically much bigger that the data being propagated through them.

    And as for AMX and convolutions, it can just load the new elements for each window position. Therefore, it potentially has a much lighter impact on the cache hierarchy.

    > Therefore it is fundamentally a data flow and bandwidth constrained problem not a
    > FPU constrained problem

    Nvidia's P100 had 720 GB/s of memory bandwidth to support 21.2 fp16 TFLOPS. The V100 increased memory bandwidth to 900 GB/s to support 125 tensor fp16 TFLOPS, and you claim that was a waste???

    Given these numbers, an Intel Xeon should be nowhere close to bandwidth-constrained if you're smart about your data movement.

    As I said, a lot of AI chips have just DDR4 memory, because they're smarter about using their on-die SRAM to avoid excessive data movement.
  • SystemsBuilder - Tuesday, July 6, 2021 - link

    I’m probably not very clear so I’ll try again, being even more technical and hopefully more precise this time – sorry for the long post.
    1. Intel manual (https://software.intel.com/content/www/us/en/devel... page 3-17. TMUL multiplication instruction TDPBF16PS is named dot product BUT in the context of matrix tiles that means tile x tile matrix multiplication, tile being a sub matrix. In other words TDPBF16PS multiplies 2 tile matrices of BF16 floats together and accumulate the result to a third tile – see detailed pseudo code on page 3-18, note in particular the 3 level nested for loop. Like the standard dgemm mm algorithm for multiplying 2 matrices, the full matrix multiplication is a series of tile * tile (in the literature sometimes called block matrices) multiplications going over rows and columns of tiles in the standard way until all tiles are multiplied the right way. Difference between AVX-512 FP32 (besides format) is that TDPBF16PS does a full tile x tile multiplication in 1 instruction while we need about 16 * 16 = 256 vfmadd231ps to achieve a tile x tile multiplication, producing a 16x16 FP32 float tile (ignoring the obvious size advantage BF16 have over/FP32). As both you and I pointed out what remains to be seen is the TDPBF16PS clock cycle Throughput and Latency. vfmadd231ps is a about 4-7 cycles (depending on port 0 or 5 etc.) and you can cut that in half one a CPU with 2 FMAs (high end Xeons have 2 AVX-512 FMAs). We’ll just have to wait for the TDPBF16PS Throughput and Latency numbers to do a final comparison.

    But this is just 1 matrix multiplication in isolation… Neural net quires a tons of serially dependent once and that is the crux.

    2. The mathematical nature of neural net training (not just inference that is much simpler) – example: basic 3 layer neural net, applying standard forward and backwards propagation. Restraining myself abit here (removing activation function, activation function derivatives, Grad, matrix hadamults, add, subs, scalar mults and other stuff, focusing on the core of the compute complexity: the matrix multiplications themselves).
    a) A forward pass of one sample batch matrix X0, requires W1*X0 -> W2*X1 -> W3*X2 -> X3 sequential and completely serially dependent matrix multiplication (X are the signal matrices, W are the weight matrices, number = layer matrix belong to, -> is the mathematical dependency between two ops). In addition to this you need to calculate and store activation function, activation function derivatives matrices at each layer. This means you need to stream X0 (mini batch) through the compute engine and calculate and store approximately 11-13 matrices each forward propagation pass iteration and, as mentions before, the L1$/L2$ cache is not big enough to keep all of them in cache permanently.
    b) Further when you are finally done with forward propagation you need to take the final layer output (ignoring Grad etc calc stuff again for simplicity) and run the matrix result backwards to calculate error (D): D3W3^T -> D2 -> D2W2^T -> D1 and lastly update the weights(W): D3X3^T -> W3, D2X2^T -> W2, D1X1^T -> W1. Also serially dependent and updated every backwards propagation iteration (one layer compute result is mathematically depended on the previous layer result), in total about another 10-12 matrices are computed and stored.
    c) Lastly you now have to repeat this for the next X0 mini batch until exhausted all samples in the training data set AND then you have to do this last for loop step again hundreds of times (i.e. all EPOCs).
    So the so called “training model” (activation + it’s derivative, W, error, and their deltas), is about 20-24+ matrices for a 3 layer network (depending on the exact flavor of algorithm and optimization used) freshly calculated and updated every iteration based on the next mini batch X0 and can in total take up 10s of MB up to GBs+ – just for the model. And you need them all in each iteration and no one is less important than another, because they are serially dependent of each other.
    My point with this is that math is math. Doesn’t matter how smart you do things unless you change the forward backwards propagation and stochastic gradient decent (or some variation thereof - there are many…) algorithm to something fundamentally different, you are stuck with these matrix calc serial dependencies AND the memory bandwidth bottle neck that comes with it.
    At the macro level the best you can hope for is to keep all activation X, W, Error etc matrices (plus potentially/ideally? “helper” matrices, e.g. delta W/error) in cache since they are updated and read so frequently, but X0 changes with every min batch in the inner loop so… Also W and Delta W matrices tend to be the largest in the entire net of matrices so there is that going on too…

    3. we stared this as comparison between intel generations Icelake/Cooperlake and Sapphire Rapids. Comparing with Nvidia's highly parallel and specialized SMID architecture (and different caching all together) with intel's general purpose CPU is is a very different discussion - and a very long one - so I'll stick to the original scope.

    4. there is some new news though about Sapphire Rapids that may help to explain how intel have addressed the memory bandwidth challenge to try to keep up with the TMUL. https://videocardz.com/newz/intel-sapphire-rapids-... Sapphire Rapids L1$: 48KB/core, L2$: 2MB/core, L3 cache is 3.75MB/core (75 Mbyte for 20 cores - this is a big upgrade) shared + the HBMe upgrade will help, but what is the latency and B/W going to be?
  • mode_13h - Wednesday, July 7, 2021 - link

    Thanks for the link to the Intel docs, but it doesn't work for me. I hope you're right that it can do more than just dot-product.

    As for your example, you focused only on training and skipped convolution. When inferencing with convolution layers, the output is much smaller than the weights. So, it's possible to apply a chunk of the weights to a portion of a batch of input data, save the intermediates, and repeat. This saves you from having to reload all the weights for every inference sample. GPUs continue to gain benefits from extremely large batch sizes, up to 512 images, in some cases.
  • SystemsBuilder - Wednesday, July 7, 2021 - link

    Here is the link to "Intel Architecture Instruction Set Extensions Programming Reference" talking about the Tiles and Tile Multiplications in chapter 3 with my references about TDPBF16PS above page 3-17, 3-18: https://software.intel.com/content/www/us/en/devel... if it does not work just search for Intel Architecture Instruction Set Extensions Programming Reference or use the link posted in the main article above.

    Yes, as I wrote earlier inference is much easier since you only need fully trained W matrices that should be stable and should be possible to keep in cache BUT that is just inference after training is done. Mathematically, inference is basically a subset of training with just one single forward propagation pass a(W1*X0) -> a(W2*X1) -> a(W3*X2) -> X3 per mini batch (if you infer in mini batch format) just calculating activation functions at each layer (a) without calculating activation derivates, Grads, etc and without the entire backwards pass and no iterations. The hard time consuming work is training! Magnitudes more complex than inference.
    In inference you can even go down in resolution from BF16 to Byte and change activation functions to something like much simpler, almost binary 1/0 across the board. And AMX has a byte Tile multiplication: TDPBUUD as well for that! You can define tiles as a signed/unsigned byte data type so now each signal is just a byte long, compressing the format even further from 16 bits to 8 bits. Inference with byte data type should get a 2x speed up from just data format compression.

    Convolutional neural network (CNN) is a sub class of general deep neural nets that I talked about in my 3 layer example so what I wrote above in general about training deep neural net applies to them too (broadly speaking) - CNN have very interesting properties that make them very attractive in many specific AI applications - images recognition etc.
  • mode_13h - Thursday, July 8, 2021 - link

    Thanks for the updated link. That works. I'll take a little time to update my knowledge of AMX, before posting more about it.

    > The hard time consuming work is training!

    Yeah, but once you've trained a model, you can inference with it billions of times. Many accelerators specialize on either training or inference. A device needn't be equally competent at both, and inference is the bigger market (also the easier problem).
  • JayNor - Tuesday, June 29, 2021 - link

    "Intel is keen to point out that some customers will use DSA, some will use Intel’s new Infrastructure Processing Unit, while some will use both"

    The IPU might contain its own DSA, right? My understanding is current generation of IPUs would be expected to have ethernet IO and PCIE IO, plus some ASIC or FPGA to accelerate encode, decode, encryption ...

    Will DSA on SPR be doing the CXL coherency DMA transfers?

    Will DSA DMA be used for all the GPU to GPU transfers on the Aurora nodes?
  • mode_13h - Friday, July 2, 2021 - link

    > Will DSA DMA be used for all the GPU to GPU transfers on the Aurora nodes?

    It seems that all of the GPUs are directly connected. So, probably not, unless the GPUs have their own built-in DSAs.
  • JayNor - Friday, July 2, 2021 - link

    The GPU coherency is maintained by the host processor home agent, so I suspect all the gpu to gpu data movements on a single node will need to go through the host.
  • zamroni - Wednesday, June 30, 2021 - link

    Will it be epycly thread ripped again by ryzen amd?

    Just like avx 512, that dl boost is effectively useless for servers because the corporate customers will use gpu or tpu, which is many times more powerful, if they are seriously use ai.
    Intel should simply uses the transistor budget for caches
  • EthiaW - Wednesday, June 30, 2021 - link

    Will Intel launch an independent DPU as Nvidia did? They already have the DSA.
  • JayNor - Friday, July 2, 2021 - link

    Yes, they announced some new IPUs.
    https://www.intel.com/content/www/us/en/newsroom/n...
  • JayNor - Friday, July 2, 2021 - link

    Jeff Watters' ISC 2021 presentation contains some news about the Crow Pass Optane that is supported by Sapphire Rapids. The news is that it will run at DDR5 speeds. I saw someone noting that Intel didn't mention Crow Pass in some slide presentation, but now it's been mentioned.
  • mode_13h - Saturday, July 3, 2021 - link

    Sounds intriguing. I hope Intel keeps enthusiasts in mind and doesn't make Optane a server-only technology/product.
  • dsplover - Tuesday, July 13, 2021 - link

    Nobody cares about 2022 when 2020/2021 was so weak. AMD will steal the APU market in a few weeks, Tiger Lake and Rocket Lake were hardly the sign of a sleeping giant that got stomped on.

    I wish they had an answer to Zen 3. Zen 4 will be the knockout punch for Intel

    I think they’ll be back in the game in 2025.

Log in

Don't have an account? Sign up now