To satisfy machine learning (ML) practitioners’ insatiable demand for higher processing power, computer architects have been on the forefront of developing “accelerated” computing solutions for ML (e.g., Google TPUs or NVIDIA Tensor Cores) that changed the landscape of the computing industry. Oftentimes quantified using TOPS (tera-operations-per-second) and TOPS/Watt, the market has been in the arms race to design the fastest and most energy-efficient ML accelerator. As such, the past five years have seen a remarkable improvement in raw compute throughput and energy-efficiency delivered with the latest ML accelerators.
Ironically, because computer architects have done such an amazing job addressing the computation bottlenecks of ML, compute primitives accelerated using conventional, GEMM (general purpose matrix multiplication) optimized ML accelerators are becoming relatively less of a concern in several emerging ML applications. In particular, personalized recommendation models for consumer facing products (e.g., e-commerce, Ads) employ “sparse” embedding layers which stand out with their high memory capacity and bandwidth demands, rendering conventional dense-optimized TPUs/GPUs suboptimal in handling the training and deployment process of recommendations. In this talk, I will share our memory-centric approach in designing system architectures for recommendation models, overcoming several key challenges of prior, compute-centric AI systems
Minsoo Rhu is an associate professor at the School of Electrical Engineering at KAIST. Prior to KAIST, he was a Senior Research Scientist at NVIDIA Research. His current research interests include computer system architecture, machine learning accelerators, and ASIC/FPGA prototyping. Notable recognitions of his work include the Facebook Faculty Research Award, Samsung Humantech Paper Award (Gold), IEEE Micro Top Picks Honorable Mention, and others.