Original Link: https://www.anandtech.com/show/15992/hot-chips-2020-live-blog-nvidia-a100-performance-500pm-pt
Hot Chips 2020 Live Blog: NVIDIA A100 Performance (5:00pm PT)
by Dr. Ian Cutress on August 17, 2020 7:50 PM EST07:58PM EDT - First talk of the GPU session is from NVIDIA, on the A100 performance and the Ampere architecture
08:00PM EDT - Open question if they'll talk about Ampere for environments other than HPC, but this session is also about 'Gaming', so you never know
08:02PM EDT - Intel's John Sell, ex-Microsoft, is the chair for the session
08:03PM EDT - Jack Choquette from NV
08:03PM EDT - A100: 54-56B transistors
08:03PM EDT - 6912 CUDA Cores
08:04PM EDT - 1.6 TB/sec HBM2 bandwidth
08:04PM EDT - Elastic GPU, scale out with 3rd Gen NVLink
08:04PM EDT - 2x-7x improvements over V100 overall
08:05PM EDT - Next-Gen Tensor Core
08:05PM EDT - More efficient, improves perf with sparsity
08:05PM EDT - Increased L1, async data movement
08:05PM EDT - Performance uplift against V100
08:06PM EDT - IEEE for FP64 MatMul
08:06PM EDT - Records on MLPerf with A100 Pods
08:06PM EDT - A100 dominates in per-chip performance as well
08:06PM EDT - Even wins against unreleased chips
08:07PM EDT - Strong scaling
08:07PM EDT - DL strong scaling
08:08PM EDT - each layer is parallelised - A100 is 2.5x for dense FP16
08:08PM EDT - A100 targeted strong scaling
08:08PM EDT - fixed size networks
08:09PM EDT - Tensor core supports more data types
08:09PM EDT - FP32 now uses TF32 OPs, supports 20x improvement for sparse data
08:10PM EDT - Increase A100 data bandwidth increases based on algorithm requirements
08:10PM EDT - 6K bytes per clock per SM for sparse
08:10PM EDT - Improved speeds and feeds, and efficiency
08:11PM EDT - A100 uses 32-thread tensor cores to reduce instructions required
08:11PM EDT - New load-global-store-shared copy bypassing the register file
08:11PM EDT - 3x in L1 BW, 2x inflight cap
08:11PM EDT - 2x efficiency
08:12PM EDT - Continually stream data improving utilization
08:13PM EDT - A100 couldn't just scale up V100 - L2 memory bandwidth wouldn't keep up
08:13PM EDT - New split L2 with crossbar - 2.3x increase in BW over V100
08:14PM EDT - Keep data near to the SM, reducing latency to L2
08:14PM EDT - Data-reuse is vital to keep utilization high
08:14PM EDT - 11x FP16 over V100 due to increased buffer support
08:14PM EDT - Larger and smarter L2
08:14PM EDT - L2 supports persistent data
08:15PM EDT - Global Memory atomics
08:15PM EDT - buffer partitioning
08:15PM EDT - DRAM BW improvements, 1.7x over V100
08:15PM EDT - ECC SECDED
08:15PM EDT - 1.25x capacity over V100 at 40 GB HBM2
08:16PM EDT - Compute data compression within L2 cache - gives capacity savings. Saves BW, even if overall perf isn't increased
08:17PM EDT - 600 GB/s NVLink, 12x 25GB links per GPU
08:17PM EDT - 2-3x in raw AI improvements in perf
08:17PM EDT - Now elastic GPU for scale-up and scale-out
08:18PM EDT - This is the DGX100 - 8 A100x and 2 Rome CPUs
08:18PM EDT - NVSwitch and PEX Switches with 200G NICs
08:18PM EDT - Scalable across multiple DGX systems
08:19PM EDT - Superpod is leaf-and-spine, for 1K GPU Pod (20 DGX100s)
08:19PM EDT - 140* DGX 100 nodes
08:19PM EDT - Separate network for storage
08:19PM EDT - modularity design allows for power distribution
08:19PM EDT - New Multi-instance GPU
08:20PM EDT - Allows 7 GPU instances on a single A100 for better utilization
08:20PM EDT - Ian: 'Why use 25 T4 GPUs when one A100 will do?'
08:20PM EDT - Allows multiple clients per GPU in training or inference
08:21PM EDT - Full QoS support in mutli-user environment
08:21PM EDT - Async programming
08:23PM EDT - Programming models help remove complexity from developers and increase ease of use as well as performance
08:23PM EDT - You want to always be copying data and using data simultaneously
08:23PM EDT - But programming is inherently linear and synchronous
08:24PM EDT - New Async algorithms
08:24PM EDT - ISO C++20 barrier
08:24PM EDT - non-blocking barrier
08:25PM EDT - async data movement
08:26PM EDT - split and arrive
08:26PM EDT - difficulty is sustaining the throughput
08:27PM EDT - creating double buffer data and compute pipeline
08:27PM EDT - Vast library of kernels already optimized for A100
08:29PM EDT - Q&A time
08:30PM EDT - Q: What voltage does A100 use? A: Not sure what shipping voltage is, might be in the whitepaper
08:31PM EDT - Q: Split L2 with crossbar - does this introduce latency variation? A: It doesn't, but it does equalizes of all the split areas due to data locality. L2 hits ends up with less latency variation
08:32PM EDT - Q: What improvements does the cache bring? A: Application dependent - some large, some small. The BW helps as it is a lot higher on a L2 hit, and latency is also decreased due to the design. We have benchmarks that shows both, A100 vs V100
08:32PM EDT - A: module is 48V from earlier
08:33PM EDT - Q: Would a large L3 help? A: We have, it's a trade off between capacity and the BW from DRAM. We have found it more beneficial to have a larger L2 and then the remaining requests to go to DRAM. L3 adds another layer of hierarchy - if it didn't get enough hits it would consume abnormally more power
08:34PM EDT - Q: Where does the compression happen? What type? A: It happens at the L2 - the data gets written to L2, and as evictions happen, we recompress back into the L2. We have multiple types of compression for different types of data, such as dense vs sparse techniques.
08:35PM EDT - Q: NVLink CPU to GPU? A: The systems we have announced are what they are. Not going to comment on anything that might be possible in future.
08:36PM EDT - Q: Advantage of TF32 vs BF16? A: TF32 operates on FP32 and take FP32 inputs processing with almost no accuracy loss. Better than just taking FP32 and trucating to FP16. It assists TF32 to FP16 conversion due to matissa lengths.
08:36PM EDT - End of talk - next talk is Intel Xe!