Original Link: https://www.anandtech.com/show/14775/hot-chips-31-live-blogs-microsoft-hololens-20-silicon
Hot Chips 31 Live Blogs: Microsoft Hololens 2.0 Silicon
by Dr. Ian Cutress on August 20, 2019 9:10 PM EST- Posted in
- CPUs
- Microsoft
- GPUs
- SoCs
- VR
- Live Blog
- AR
- Hololens 2
- Hot Chips 31
09:03PM EDT - The final presentation of Hot Chips 31 is from Microsoft, who will be lifting the lid of the silicon behind its HoloLens 2.0 product.
09:17PM EDT - Here we go
09:17PM EDT - HPU 2.0
09:18PM EDT - Holographic processor
09:18PM EDT - Custom silicon, obviously
09:19PM EDT - This speaker has been trained. There are purposeful pauses when she lists stuff
09:20PM EDT - Application processor runs the app, and the HPU modifies the rendered image and sends to the display
09:20PM EDT - HPU works on specific workloads
09:21PM EDT - Takes the visual cues and allows the HPU to track where the hands are at all times
09:21PM EDT - 79mm2 on TSMC 16FF+
09:21PM EDT - 123M gates, 2B transistors
09:22PM EDT - 2016 Tapeout
09:22PM EDT - 125 Mb of SRAM
09:22PM EDT - First prototype
09:22PM EDT - First prototype headset*
09:23PM EDT - HPU 2 is dedicated to only Microsoft workloads
09:23PM EDT - Targets a single Microsoft RTOS
09:23PM EDT - No MMUs, simple interrupts
09:23PM EDT - Frees up the hardware
09:23PM EDT - Works with the software team to configure caches and memory
09:24PM EDT - Balance between dedicated HW compute and flexibility / programmability
09:24PM EDT - SIMD Fixed Point at top
09:24PM EDT - Does 2D processing
09:24PM EDT - FVP, Floating Vector Processor on bottom, does 3D
09:24PM EDT - 2 Tensiilica processors per node
09:24PM EDT - Trade off area for latency - low latency was key
09:24PM EDT - DMA channel per core
09:25PM EDT - New depth based algorithms
09:25PM EDT - 13 statically assigned compute cores
09:25PM EDT - >1 TOP of programmable compute
09:25PM EDT - 100s of customized instructions
09:26PM EDT - Algorithm profiling to turn 10s of ops into a single instruction
09:27PM EDT - Example, boxavg_2x16x8 is a single cycle instruction
09:27PM EDT - instruction is applied to every pixel, saving 10k+ cycles per frame
09:27PM EDT - Hardened compute on ToF sensor
09:27PM EDT - JBL filter
09:28PM EDT - Uses 3 sensors and applies filter
09:28PM EDT - But didn't fit on the node. But adjusted a C model into RTL, for hardware. Reduces power to 1/3, and 1/30th latency
09:29PM EDT - Now thermals
09:30PM EDT - Power gating, clock gating, removing ULV cells
09:31PM EDT - Most digital logic at 250 MHz, compute at 500 MHz
09:31PM EDT - Reduced voltage, Vmin
09:31PM EDT - DVFS per chip
09:32PM EDT - Could take the guard bands off
09:32PM EDT - Can reduce the power by 20%
09:32PM EDT - at Vmin
09:32PM EDT - Now system integration
09:33PM EDT - HPU in front, App processor in back
09:33PM EDT - PCIe 2.0 x1 at 100 MB/s comms between front and back
09:34PM EDT - Rendered images sent back via MIPI to HPU
09:34PM EDT - MIPI QoS rates
09:35PM EDT - 6.8 GB/s needed to sync into two lanes of LPDDR4
09:35PM EDT - Custom DRAM scheduler on HPU 2.0
09:36PM EDT - Hologram stability
09:37PM EDT - Multiple pose updates per frame
09:37PM EDT - Hardened block on HPU decouples the render resolution to display resolution
09:37PM EDT - Gives more thermal headroom to GPU
09:38PM EDT - HPU Timestamps the sensor data as it comes in
09:38PM EDT - Hardened neural network
09:39PM EDT - Q&A
09:40PM EDT - Q: Comment on depth camera? A: Custom ToF, there's a lot of literature out.
09:40PM EDT - Q: Scheduler? A: Statically assigned algorithms to the compute units
09:41PM EDT - Q: Transfer data between different VFPs? A: Small amount of bandwidth between VFPs, but mostly between memory.
09:43PM EDT - That's a wrap! Thank you for staying with us through all the Hot Chips coverage!