Hot Chips 2018: Fujitsu's A64FX Arm Core Live Blog - Print View

Original Link: https://www.anandtech.com/show/13258/hot-chips-2018-fujitsu-afx64-arm-core-live-blog

Hot Chips 2018: Fujitsu's A64FX Arm Core Live Blog

by Dr. Ian Cutress on August 21, 2018 8:25 PM EST

9 Comments

08:30PM EDT - Remember back when Arm announced Scalable Vector Extensions? Well Fujitsu has made an Arm CPU that uses it with a 512-bit width. The presentation looks super interesting, so follow along with our live blog. The talk is set to start at 5:30pm PT / 12:30am UTC.

08:32PM EDT - Last time we were here, had a 3-min presentation about Post-K

08:32PM EDT - Called A64FX

08:32PM EDT - First chip to use Arm SVE

08:32PM EDT - Scalable Vector Extensions

08:33PM EDT - New microarch maximises SVE perf

08:33PM EDT - Fujitsu has been making processors for 60 years

08:34PM EDT - SPARC? Remember that?

08:34PM EDT - UNIX, HPC, Mainframe, now HPC + AI

08:34PM EDT - New CPU inherits DNA from Fujitsu

08:34PM EDT - Reliability, speed, flexibility, high perf/watt

08:34PM EDT - end up with CPU w/ extremely high throughput

08:35PM EDT - low power

08:35PM EDT - (A64FX doesn't mean Athlon 64, FX)

08:35PM EDT - Optimized for massively parallel

08:35PM EDT - Four features

08:35PM EDT - Perf: FP64 through to INT8

08:36PM EDT - Throughput: 512-bit SIMD x 2 pipes/core, HBM2, 48-cores, Tofu interconnect

08:36PM EDT - Efficiency: GEMM and Triad perf

08:37PM EDT - Standards: Arm v8.2 + SVE + SBSA level 3 (Server Base System Architecture)

08:37PM EDT - AArch64 only, no 32

08:37PM EDT - 48 computing cores and 4 identical assistant cores

08:37PM EDT - 32GiB HBM2

08:38PM EDT - 6D Mesh - 28 Gbps x 2 lanes x 10 ports

08:38PM EDT - PCIe 3.0 x16

08:38PM EDT - Built on 7nm FinFET

08:38PM EDT - 8.786B transistors, but only 594 pin

08:38PM EDT - 2.7 TFLOPS

08:38PM EDT - 1TB/s memory bandwidth

08:39PM EDT - ISA feature support

08:39PM EDT - Optimized SVE for wide range of applications

08:39PM EDT - INT8 Dot Product

08:39PM EDT - Enhanced compression

08:39PM EDT - AI applications

08:40PM EDT - HW Barrier and Sector cache - implementation defined system registers from AArch64

08:40PM EDT - Enahnced blocks in chip

08:40PM EDT - Predicated operations dedicated pipe

08:41PM EDT - SVE has limitation on operands - FMA equivalent requires destructive 3-operand FMA3

08:41PM EDT - MOVPRFX instruction

08:41PM EDT - hides overhead of main pipelin

08:42PM EDT - 21.6 TOPS for INT8 dot product

08:42PM EDT - 90% execution efficiency

08:42PM EDT - Still 2x in 64-bit DGEMM over SPARC64 PrimeHPC FX100

08:43PM EDT - Almost 20x the K comp in DGEMM

08:43PM EDT - L1 cache is key to design for 512-bit SIMD

08:43PM EDT - Combined Gather mechanism to increase throughput

08:44PM EDT - Combined Gather enables return up to two consecutive elements in a 128-byte aligned block

08:44PM EDT - Throughput per core is 32 bytes/cycle

08:44PM EDT - Full chip is Divided into four memory groups

08:45PM EDT - One CMG is 13 cores, an L2 cache, and a memory controller

08:45PM EDT - One core handles Daemon/IO

08:45PM EDT - Cache coherency by ccNUMA with on-chip directory

08:45PM EDT - X-bar connection for L2 cache efficiency

08:45PM EDT - Process binding ensures scaling

08:45PM EDT - Wide Ring Bus for IO across whole chip

08:46PM EDT - Bandwidth in cache and memory is key

08:46PM EDT - Out-of-order mechanisms in cores, caches, and IMCs

08:46PM EDT - L1 cache at 11.0 TB/s

08:46PM EDT - L2 cache is 3.6 TB/s

08:47PM EDT - Normalized compared to previous processor, perf is 2x across wide range of workloads

08:48PM EDT - For AI, convolution low precision is 9.4x using INT8 dot product

08:48PM EDT - Each chip has energy monitor in msec

08:49PM EDT - Each core has energy analyzer in nanosec

08:49PM EDT - Fine grained power analysis of a core, an L2 cache and memory

08:49PM EDT - Power Knob for optimization

08:49PM EDT - Can change hardware config for power

08:50PM EDT - Change decode width, floating point pipeline, and general frequency reduction

08:50PM EDT - Extensive RAS

08:50PM EDT - ECC on all caches

08:50PM EDT - Parity cehc on execution units

08:50PM EDT - 128400 error checkers

08:50PM EDT - Parity Check* on execution units

08:51PM EDT - Hardware instruction retry

08:52PM EDT - Software stacks developed by RIKEN and Fujitsu

08:52PM EDT - Will continue to use Arm in the future

08:52PM EDT - Work with partners

08:53PM EDT - Q&A time

08:54PM EDT - Q: When can you reach exascale? A: The Post-K system will be available in 2021. 100x perf from K-comp. But exa-scale not answerable

08:55PM EDT - Q: nanosecond level power monitoring - what techniques do you use? A: Activity based on coefficient based on operations

08:56PM EDT - Q: Support 64-bit FP, not 128-bit? A: No.

08:59PM EDT - That's a wrap. Next talk is on the NEC Vector processor: https://www.anandtech.com/show/13259/hot-chips-2018-nec-vector-processor-live-blog

Facebook

Twitter

RSS

About

Advertising

Privacy Policy

Contact Us

Terms of Use

Show Full Site

Privacy Settings

Copyright © 2025. All rights reserved.

Log in

Don't have an account? Sign up now

Lost your password?

Weâ€™ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Siteâ€™s updated Terms of Use and Privacy Policy.