A Scalable Multi-TeraOPS Deep Learning Processor Core for AI Trainina and Inference

Bruce Fleischer; Sunil Shukla; Matt Ziegler; Joel Silberman; Jinwook Oh; Vijayalakshmi Srinivasan; Jungwook Choi; Silvia Müller; Ankur Agrawal; Tina Babinsky; Nianzheng Cao; Chia-Yu Chen; Pierce Chuang; Thomas Fox; George Gristede; Michael Guillorn; Howard Haynie; Michael Klaiber; Dongsoo Lee; Shih Hsien Lo; Gary Maier; Michael Scheuermann; Swagath Venkataramani; Christos Vezyrtzis; Naigang Wang; Fanchieh Yee; Ching Zhou; Pong-Fei Lu; Brian Curran; Lel Chang; Kailash Gopalakrishnan

doi:10.1109/VLSIC.2018.8502276

VLSI Circuits 2018

Conference paper

22 Oct 2018

A Scalable Multi-TeraOPS Deep Learning Processor Core for AI Trainina and Inference

View publication

Abstract

A multi-TOPS AI core is presented for acceleration of deep learning training and inference in systems from edge devices to data centers. With a programmable architecture and custom ISA, this engine achieves >90% sustained utilization across the range of neural network topologies by employing a dataflow architecture and an on-chip scratchpad hierarchy. Compute precision is optimized at 16b floating point (fp 16) for high model accuracy in training and inference as well as 1b/2b (bi-nary/ternary) integer for aggressive inference performance. At 1.5 GHz, the AI core prototype achieves 1.5 TFLOPS fp 16, 12 TOPS ternary, or 24 TOPS binary peak performance in 14nm CMOS.

Conference paper

POWER8^TM: A 12-core server-class processor in 22nm SOI with 7.6Tb/s off-chip bandwidth

Eric J. Fluhr, Joshua Friedrich, et al.

ISSCC 2014

Paper

A 7-nm Four-Core Mixed-Precision AI Chip with 26.2-TFLOPS Hybrid-FP8 Training, 104.9-TOPS INT4 Inference, and Workload-Aware Throttling

Sae Kyu Lee, Ankur Agrawal, et al.

IEEE JSSC

Conference paper

Scalable Auto-Tuning of Synthesis Parameters for Optimizing High-Performance Processors

Matt Ziegler, Hung-Yi Liu, et al.

ISLPED 2016

Conference paper

INVITED: Accelerator Design for Deep Learning Training: Extended Abstract: Invited

Ankur Agrawal, Chia-Yu Chen, et al.

DAC 2017

View all publications

Abstract

Related

POWER8TM: A 12-core server-class processor in 22nm SOI with 7.6Tb/s off-chip bandwidth

A 7-nm Four-Core Mixed-Precision AI Chip with 26.2-TFLOPS Hybrid-FP8 Training, 104.9-TOPS INT4 Inference, and Workload-Aware Throttling

Scalable Auto-Tuning of Synthesis Parameters for Optimizing High-Performance Processors

INVITED: Accelerator Design for Deep Learning Training: Extended Abstract: Invited

POWER8^TM: A 12-core server-class processor in 22nm SOI with 7.6Tb/s off-chip bandwidth