Original Link: https://www.anandtech.com/show/16007/hot-chips-2020-live-blog-manticore-4096core-riscv-330pm-pt



06:35PM EDT - Who wants all the RISC-V cores?!?

06:36PM EDT - Ever growing demand for compute

06:36PM EDT - Energy efficiency is critical

06:37PM EDT - lots of CPUs burn power on superfluous elements of out-of-order

06:38PM EDT - Maximise computer datapath with respect to control

06:38PM EDT - Now for Manticore

06:38PM EDT - 220mm2 per chip

06:38PM EDT - (estimated in 22FDX GloFo)

06:38PM EDT - Four chiplets

06:39PM EDT - die-to-die serial link to each other die

06:39PM EDT - 8 GB HBM2 per die private to that die

06:40PM EDT - Four quadrants of 32 clusters per chiplet

06:40PM EDT - Clusters can do 64 TB/s with each other

06:41PM EDT - 4x L1 quadrants share an L1 cache

06:41PM EDT - Bandwidth thinning scheme to optimize bandwidth to HBM without affecting floorplan

06:41PM EDT - Support a lot of cluster-to-cluster traffic

06:42PM EDT - Each compute cluster has 8 RV32G Snitch cores

06:42PM EDT - Each core has a multi-format SIMD compute unit

06:42PM EDT - supports half-precision bfloat, FP8

06:42PM EDT - Custom ISA extensions

06:44PM EDT - Goal was to maximize compute/control die area ratio

06:44PM EDT - Async with DMA Engine

06:44PM EDT - XSSR - Stream semantic registers

06:44PM EDT - Turn register read/writes into implicit memory load/stores

06:45PM EDT - increases FPU/ALU from 3x-5x

06:46PM EDT - Extension in the core register file

06:47PM EDT - Latency tolerant approach

06:47PM EDT - XFREP - Floating Point Repetition Buffer (programmable micro-loop buffer)

06:48PM EDT - custom instruction indicates start of hardware loop block

06:48PM EDT - 'Psuedo-dual issue' as integer core can work at the same time

06:49PM EDT - SSRs only work on float-only hardware loops

06:49PM EDT - FREP marks the loop

06:50PM EDT - For example, reduction!

06:52PM EDT - single-issue core can saturate an FPU

06:52PM EDT - IPC > 1

06:52PM EDT - FREP acts as instruction amplifier

06:53PM EDT - increased utilization for matmul and dotproduct that might be memory bound

06:54PM EDT - Up to 80 DP GFLOPs/W per cluster

06:55PM EDT - Close tracking of roofline model

06:56PM EDT - 9mm2 prototype made

06:56PM EDT - 22nm FDX

06:56PM EDT - Forward Body Biasing

06:56PM EDT - This is only a prototype small core of chiplet

06:57PM EDT - Snitch cores used for DVFS and IO management

06:58PM EDT - Full 4096 core system expected 27 DP Flops/sec

07:00PM EDT - In max perf mode, competitive vs A100 FP64

07:02PM EDT - snitch inside

07:05PM EDT - Q&A time

07:06PM EDT - Q: how does the compiler target the new instrutcions? A: Loop detection to promote loops that have the required characteristics. Might not always hit all cases - so go down QDNN, offer optimized low level kernels that frameworks would support

07:07PM EDT - Q: Productization? A: Concept so far to explore the key components. Wanted lean and mean RISC-V cores. Still missing the key components at SoC level, such as interconnects, which as a university is hard to come by. Looking into to generating and taping out later system in a research concept in the future.

07:08PM EDT - That's a wrap. Short break until the next sesstion, at half-past. Baidu + Alibaba NPUs

Log in

Don't have an account? Sign up now