Hot Chips 2018: AMD APU Optimization Live Blog (Noon PT, 7pm UTC)

Original Link: https://www.anandtech.com/show/13243/hot-chips-2018-amd-on-raven-ridge-optimizations

Hot Chips 2018: AMD APU Optimization Live Blog (Noon PT, 7pm UTC)

by Ian Cutress on August 20, 2018 2:55 PM EST

9 Comments

03:01PM EDT - AMD is also at Hot Chips, speaking about Raven Ridge and its APUs. The key elements to this talk will be the optimizations made for Raven Ridge, specifically around power and data management. We'll be live blogging the talk for everyone to follow.

03:02PM EDT - Working on APUs for several years

03:02PM EDT - Raven Ridge allowed the users to experience an uplift in performance

03:03PM EDT - Focusing on power and power efficiency

03:03PM EDT - GPUs have an appetite for memory - getting efficiency is important, as is power

03:03PM EDT - Battery life is a key consideration too

03:03PM EDT - 4C/8T, 11 Vega CUs

03:04PM EDT - ALmost everything is new - new Upgraded display engine, new audio coprocessor, new IO subsystem, USB 3.1 and USB-C native

03:05PM EDT - Went big on the GPU

03:05PM EDT - Up 60% transistors from Bristol

03:05PM EDT - But 16% smaller based on process note

03:05PM EDT - BGA package

03:06PM EDT - Also in desktop

03:06PM EDT - Dedicated L2 cache in the GPU

03:06PM EDT - Flexible geometry engine

03:06PM EDT - 16 pixel units

03:06PM EDT - two render back-ends

03:06PM EDT - 1200 Tri/sec at 1200 MHz

03:08PM EDT - Smaller L3 cache than desktop

03:08PM EDT - Precision Boost 2 for dynamic core frequency

03:09PM EDT - GPU workloads tends to go through phases of render and physics phases

03:09PM EDT - Can adjust power based on where it is needed

03:09PM EDT - Fine grained p-states

03:09PM EDT - On die power regulation

03:09PM EDT - Can exploit all the power

03:10PM EDT - Heart of the chip is the fabric

03:10PM EDT - oen coherent protocol - Infinity Fabric

03:10PM EDT - Manages the full SoC

03:10PM EDT - Overall power budget and management

03:10PM EDT - Enhanced flow for powering on and off components

03:11PM EDT - Knew in 2013/2014 AMD was going to major reset the CPU portfolio

03:11PM EDT - IF was designed to scale from Server to high-end graphics into smaller mobile SoCs and desktop

03:11PM EDT - IF is a Transport Layer

03:12PM EDT - Scalable Data Ports and SDP interface modules to the transport layer

03:12PM EDT - SDP hides complexities of coherence protocol from connected IP

03:12PM EDT - CPU, GPU, Memory, IO, all use an SDPIM into the transport layer

03:13PM EDT - Transport Layer Switches are crossbars

03:13PM EDT - Coherent ports in region A/B

03:13PM EDT - Structured for multi-region power gating

03:13PM EDT - Up to 5 transfers per clock per switch

03:14PM EDT - Turn off bits of the fabric that are not needed to save power

03:14PM EDT - Transport request queues at each entry point into the fabric

03:14PM EDT - Hard Real Time, Soft Real Time, and Non-Real Time

03:15PM EDT - Within this, multiple virtual queues and channels

03:15PM EDT - Also priority classes end-to-end

03:15PM EDT - Can escalate through the entire fabric

03:15PM EDT - Need to improve mem bandwidth for GPU

03:15PM EDT - Bigger CPU and GPU caches helped a lot

03:16PM EDT - Caching algorithms and lossless compression (DCC) helps

03:16PM EDT - Direct reads of compressed memory

03:16PM EDT - Shadow of Mordor seems more memory active

03:17PM EDT - Direct compare Bristol Ridge and Ravin Ridge

03:17PM EDT - Draw Stream Binning Radterizer improves bandwidth

03:18PM EDT - Increase display engines - reset the engine and set a target to do a 4K display at Vmin (lowest voltage of process)

03:18PM EDT - Four pipes and flexibility to combine pipes or act independent

03:18PM EDT - 4x throuput over Bristol

03:18PM EDT - Keeping up with codec improvements

03:19PM EDT - Rather than 3 rails (CPU, GPU, IO), now have one

03:19PM EDT - 3 rail methods meant overprovisioning required - very inefficient. Now can be very efficient

03:19PM EDT - In-chip, rather than do FIVR, do individual LDO in each CPU core and graphics

03:19PM EDT - Allows more fine grained power delivery

03:20PM EDT - Allows for better control

03:20PM EDT - From Bristol Ridge, higher boost current but smaller regulator, overall lower ICCMax for CPU+GPU

03:21PM EDT - Each core has a regulator, so can control power states and power down each core when idle

03:21PM EDT - Fast graphics off and power down when needed

03:21PM EDT - New power modes

03:22PM EDT - managing a lot of skin temps through STAPM

03:22PM EDT - Skin Temperature Aware Power management

03:22PM EDT - Previous gen didn't exploit thermal headroom

03:22PM EDT - Now can go into thermal budgets and maximise performance

03:23PM EDT - Performance goes up

03:23PM EDT - Now 25x20 goal

03:24PM EDT - Above progress goal

03:24PM EDT - New systems will come out in the future to meet the target

03:25PM EDT - Q&A Time

03:27PM EDT - Q: Talk coherency and master/slave. A: The masters are participating in the protocol, slaves are the memory controllers. Manage all coherent / non coherent traffic

03:28PM EDT - Q: Are the display controllers coherent? A: The frame buffer is considered non-coherent memory

03:29PM EDT - Q: Moore's Law is painful but you can stay on course for 25x20 ? A: We are seeing improvements, looking forward to 7nm. Looking beyond that. Challenges today are more like IO scaling. If we start looking at IO bottlenecks, such as mem and mem throughput, that's our bigger challenges. Also, thermal management, thermal density, trying to tune every last bit in the device.

03:29PM EDT - That's a wrap. Next Live Blog at 4pm PT.

Hot Chips 2018: AMD APU Optimization Live Blog (Noon PT, 7pm UTC)

Log in

Don't have an account? Sign up now