The contributed talks are split into two sessions.

Contributed Talk Session 1

  1. (10:30am – 10:45am) MERCI: Efficient Embedding Reduction on Commodity Hardware via Sub-Query Memoization
    • Author list:  Yejin Lee, Seong Hoon Seo, Hyunji Choi, Hyoung Uk Sul, Soosung Kim, Jae W. Lee, Tae Jun Ham
    • Abstract: Deep neural networks (DNNs) with embedding layers are widely adopted to capture complex relationships among entities within a dataset. Embedding layers aggregate multiple embeddings—a dense vector used to represent the complicated nature of a data feature— into a single embedding; such operation is called embedding reduction. Embedding reduction spends a significant portion of its runtime on reading embeddings from memory and thus is known to be heavily memory bandwidth-bound. Recent works attempt to accelerate this critical operation, but they often require either hardware modifications or emerging memory technologies, which makes it hardly deployable on commodity hardware. Thus, we propose MERCI, memoization for embedding reduction with clustering, a novel memoization framework for efficient embedding reduction. MERCI provides a mechanism for memoizing partial aggregation of correlated embeddings and retrieving the memoized partial result at a low cost. MERCI substantially reduces the number of memory accesses by 44% (29%), leading to 102% (74%) throughput improvement on real machines and 40.2% (28.6%) energy savings at the expense of 8× (1×) additional memory usage.
    • Speaker bio: Yejin is a Ph.D candidate student in the Department of Computer Science and Engineering at Seoul National University working with Professor Jae W. Lee. Her current research area includes hardware-software co-design for big data analytics and data mining. Specifically, her research aims to accelerate such domains by identifying the relative criticality of the data and supporting critical data with specialized hardware architectures. Her research has been recognized with an IEEE Micro Top Picks distinction and has appeared in top computer architecture and systems venues like ASPLOS and ISCA.
  2. (10:45am – 11:00am) Erasure Coding Based Fault Tolerance for Recommendation Model Training
    • Author list: Kaige Liu, Jack Kosaian, Rashmi Vinayak
    • Abstract: Deep-learning-based recommendation models (DLRMs) are widely deployed to serve personalized content to users. DLRMs are large in size due to their use of large embedding tables, and are trained by  distributing the model across the memory of tens or hundreds of servers. Server failures are common in such large distributed systems and must be mitigated to enable training to progress. Checkpointing is the primary approach used for fault tolerance in these systems, but incurs significant training-time overhead both during normal operation and when recovering from failures. As these overheads increase with DLRM size, checkpointing is slated to become an even larger overhead for future DLRMs, which are expected to grow in size. This calls for rethinking fault tolerance in DLRM training. We present ECRM, a DLRM training system that achieves efficient fault tolerance using erasure coding. ECRM chooses which DLRM parameters to encode, correctly and efficiently updates parities, and enables training to proceed without any pauses, while maintaining consistency of the recovered parameters. We implement ECRM atop XDL, an open-source, industrial-scale DLRM training system. Compared to checkpointing, ECRM reduces training-time overhead for large DLRMs by up to 88%, recovers from failures up to 10.3x faster, and allows training to proceed during recovery. These results show the promise of erasure coding in imparting efficient fault tolerance to training current and future DLRMs.
    • Speaker bio: Kaige Liu is a software engineer at Facebook. He received his B.S. and M.S. in Computer Science from Carnegie Mellon University
  3. (11:00am – 11:15am) Elliot: A Comprehensive and Rigorous Framework For Reproducible Recommender Systems Evaluation
    • Author list: Vito Walter Anelli, Alejandro Bellogín, Antonio Ferrara, Daniele Malitesta, Felice Antonio Merra, Claudio Pomo, Francesco Maria Donini, Tommaso Di Noia
    • Abstract: The impressive number of recommendation algorithms, splitting strategies, evaluation protocols, metrics, and tasks, has made rigorous experimental evaluation challenging, especially when Neural architectures are involved. Puzzled and frustrated by the endless recreation of appropriate evaluation benchmarks, we have developed a framework to address such needs. Elliot is a comprehensive recommendation framework that aims to run and reproduce an entire experimental pipeline by processing a simple configuration file. Thanks to Elliot is possible to reproduce, in a rigorous way, the entire evaluation pipeline across more the 51 recommendation models (25 DNN models), optimize the model hyperparameters and explore several dimension of evaluation (36 metrics) including 13 fairness and bias state of the art metrics.
    • Speaker bio: Vito Walter Anelli is an Assistant Professor at Polytechnic University of Bari, affiliated with the Information Systems Laboratory (SisInf Lab). His current research interests fall in the areas of Recommender Systems, Knowledge representation, and User Modeling. He has been awarded with the best research student paper at ISWC 2019. He has recently published a book chapter on interpretable RSs Systems, and has published his works in international conferences, e.g., ECIR, RecSys, UMAP, SAC, ISWC, and ESWC; as well as in international journals, e.g., UMUAI, TKDE, and SWJ. Furthermore, he has served as chair of international Workshops on Knowledge-aware Recommender Systems, RecSys 2021 challenge track, and RecSys 2020 challenge track.
    • Speaker bio: Claudio Pomo is a PhD student at the Polytechnic University of Bari supervised by Prof. Tommaso Di Noia and Prof. Francesco Donini. His research interests include interpretable and explainable Recommender Systems.
  4. (11:15am – 11:30am) Optimizing Deep Learning Recommender Systems Training on CPU Cluster Architectures
    • Author list: Dhiraj Kalamkar, Evangelos Georganas, Sudarshan Srinivasan, Jianping Chen, Mikhail Shiryaev, and Alexander Heinecke
    • Abstract: During the last two years, the goal of many researchers has been to squeeze the last bit of performance out of HPC system for AI tasks. Often this discussion is held in the context of how fast ResNet50 can be trained. Unfortunately, ResNet50 is no longer a representative workload in 2020. Thus, we focus on Recommender Systems which account for most of the AI cycles in cloud computing centers. More specifically, we focus on Facebook’s DLRM benchmark. By enabling it to run on latest CPU hardware and software tailored for HPC, we are able to achieve up to two-orders of magnitude improvement in performance on a single socket compared to the reference CPU implementation, and high scaling efficiency up to 64 sockets, while fitting ultra-large datasets which cannot be held in single node’s memory. Therefore, this paper discusses and analyzes novel optimization and parallelization techniques for the various operators in DLRM. Several optimizations (e.g. tensor-contraction accelerated MLPs, framework MPI progression,  BFLOAT16 training with up to 1.8x speed-up) are general and transferable to many other deep learning topologies
    • Speaker bio: Dhiraj is a research scientist in Intel’s Parallel Computing Lab in Bangalore. His research interests include parallel computer architecture, GPGPU architectures, hardware specific single node and distributed performance scaling optimizations. Recently, he is working on analyzing and optimizing deep learning workloads, frameworks and libraries for Intel Xeon and GPU architectures. Dhiraj led the early efforts to demonstrate superiority of BFloat16 over INT16 for training DL workloads helping set directions for low-precision Xeon DL roadmap for future generations. In general, he has a proven track record demonstrating how to realize best performance over Xeon processors
  5. (11:30am – 11:45am) Main-Memory Acceleration for Bandwidth-Bound Deep Learning Inference
    • Author list: Benjamin Cho, Jeageun Jeung, Mattan Erez
    • Abstract: Deep learning (DL) inference queries play an important role in diverse internet services and a large fraction of datacenter cycles are spent on processing DL inference queries. Specifically, matrix-matrix multiplication (GEMM) operations of fully-connected MLP layers dominate many inference tasks. We find that the GEMM operations for datacenter DL inference tasks are memory bandwidth bound, contrary to common assumptions: (1) strict query latency constraints force small-batch operation, which limits reuse and increases bandwidth demands; and (2) large colocated models require reading the large weight matrices from main memory, again requiring high bandwidth without offering reuse opportunities. We demonstrate the large potential of accelerating these small-batch GEMMs with processing in the main CPU memory. We develop StepStone PIM, a novel GEMM execution flow and corresponding memory-side address-generation logic that exploits GEMM locality and enables long-running PIM kernels despite the complex address-mapping functions employed by the CPU that would otherwise destroy locality.
    • Speaker bio:  Benjamin Cho is a Ph.D. student advised by Prof. Mattan Erez at the University of Texas at Austin. His research focuses on enabling processing in/near memory devices in conventional CPU systems and leveraging computing resources of the CPU and PIMs in parallel. His current interest is in applying this approach to DL inference tasks in warehouse-scale servers and database management systems.
  6. (11:45am – 12:00pm) DeepRecSys: A System for Optimizing End-To-End At-scale Neural Recommendation Inference
    • Author list: Udit Gupta, Samuel Hsia, Vikram Saraph, Xiaodong Wang, Brandon Reagen, Gu-Yeon Wei, Hsien-Hsin S. Lee, David Brooks, Carole-Jean Wu
    • Abstract: Neural personalized recommendation is the corner-stone of a wide collection of cloud services and products, constituting significant compute demand of the cloud infrastructure. Thus, improving the execution efficiency of neural recommendation directly translates into infrastructure capacity saving. In this paper, we devise a novel end-to-end modeling infrastructure, DeepRecInfra, that adopts an algorithm and system co-design methodology to custom-design systems for recommendation use cases. Leveraging the insights from the recommendation characterization, a new dynamic scheduler, DeepRecSched, is proposed to maximize latency-bounded throughput by taking into account characteristics of inference query size and arrival patterns, recommendation model architectures, and underlying hardware systems. By doing so, system throughput is doubled across the eight industry-representative recommendation models. Finally, design, deployment, and evaluation in at-scale production datacenter shows over 30% latency reduction across a wide variety of recommendation models running on hundreds of machines.
    • Speaker bio: Udit Gupta is a 5th year PhD student in CS at Harvard University and received his B.S. in ECE from Cornell University in 2016. His research interests focus on improving the performance and energy efficiency of emerging applications in computer systems and architecture by co-designing solutions across the computing stack. His recent work explores the characterization and optimization of at-scale deployment of deep learning based personalized recommendation systems

Contributed Talk Session 2

  1. (4:00pm – 4:15pm) Cross-Stack Workload Characterization of Deep Recommendation Systems
    • Author list: Samuel Hsia, Udit Gupta, Mark Wilkening, Carole-Jean Wu, Gu-Yeon Wei, David Brooks
    • Abstract: Deep learning based recommendation systems form the backbone of most personalized cloud services. Though the computer architecture community has recently started to take notice of deep recommendation inference, the resulting solutions have taken wildly different approaches – ranging from near memory processing to at-scale optimizations. To better design future hardware systems for deep recommendation inference, we must first systematically examine and characterize the underlying systems-level impact of design decisions across the different levels of the execution stack. In this paper, we characterize eight industry-representative deep recommendation models at three different levels of the execution stack: algorithms and software, systems platforms, and hardware microarchitectures. Through this cross-stack characterization, we first show that system deployment choices (i.e., CPUs or GPUs, batch size granularity) can give us up to 15 times speedup. To better understand the bottlenecks for further optimization, we look at both software operator usage breakdown and CPU frontend and backend microarchitectural inefficiencies. Finally, we model the correlation between key algorithmic model architecture features and hardware bottlenecks, revealing the absence of a single dominant algorithmic component behind each hardware bottleneck.
    • Speaker bio: Samuel Hsia is a Computer Science PhD student at Harvard University on the NSF GRFP Fellowship. Samuel is a member of the Harvard Architecture, Circuits, and Compilers group, advised by Professors Brooks and Wei. Samuel’s research interests include computer architecture and systems for machine learning. Recently, his research has focused on the systems-level implications of deep recommendation systems. This work includes both at-scale optimizations and cross-stack (algorithms to microarchitecture) characterizations of deep recommendation inference. Prior to Harvard, Samuel graduated from Princeton University with a bachelor’s degree in Electrical Engineering and minors in Computer Science and Statistics & Machine Learning.
  2. (4:15pm – 4:30pm) Accelerated Learning by Exploiting Popular Choices
    • Author list: Muhammad Adnan, Yassaman Ebrahimzadeh Maboud, Divya Mahajan, Prashant Nair
    • Abstract: Recommendation models are commonly used learning models that suggest relevant items to a user for e-commerce and  online  advertisement-based  applications.  Current  recommendation  models  include  deep-learning  based  (DLRM)  and time-based sequence (TBSM) models. These models use massive embedding tables to store numerical representation of item’s and user’s  categorical  variables  (memory  bound)  while  also  using neural  networks  to  generate  outputs  (compute  bound).  Due  to these conflicting compute and memory requirements, the training process  for  recommendation  models  is  divided  across  CPU  and GPU for embedding and neural network executions, respectively. Such a training process naively assigns the same level of importance to each embedding entry. This paper observes that some training inputs  and  their  accesses  into  the  embedding  tables  are  heavily skewed  with  certain  entries  being  accessed  up  to 10000× more. This  paper  tries  to  leverage  skewed  embedded  table  accesses to efficiently use the GPU resources during training. To this end, this  paper  proposes  a Frequently  Accessed  Embeddings (FAE) framework  that  provides  three  key  features.  First,  it  exposes a  dynamic  knob  to  the  software  based  on  the  GPU  memory capacity  and  the  input  popularity  index,  to  efficiently  estimate and  vary  the  size  of  the hot portions  of  the  embedding  tables. These  hot  embedding  tables  can  then  be  stored  locally  on  each GPU.  FAE  uses  statistical  techniques  to  determine  the  knob, which  is  a  threshold  on  embedding  accesses,  without  profiling the entire input dataset. Second, FAE pre-processes the inputs to segregate  hot  inputs  (which  only  access  hot  embedding  entries) and  cold  inputs  into  a  collection  of  hot  and  cold  mini-batches. This  ensures  that  a  training  mini-batch  is  either  entirely  hot  or cold to obtain most of the benefits. Third, at runtime FAE generates a  dynamic  schedule  for  the  hot  and  cold  training  mini-batches that  minimizes  data  transfer  latency  between  CPU  and  GPU executions while maintaining the model accuracy. The framework execution  then  uses  the  GPU(s)  for  hot  input  mini-batches  and a baseline CPU-GPU mode for cold input mini-batches. Overall, our  framework  speeds-up  the  training  of  the  recommendation models  on  Kaggle,  Terabyte,  and  Alibaba  datasets  by  2.34× as compared  to  a  baseline  that  uses  Intel-Xeon  CPUs  and  Nvidia Tesla-V100  GPUs,  while  maintaining  accuracy.
    • Speaker bio: Muhammad Adnan is a 2nd year MaSc. student at the University of British Columbia where he is transitioning into the Ph.D. program. He is advised by Prof. Prashant Nair. His research focuses on exploring the hardware-software co-design to optimize Deep Neural Networks. To this end, he has investigated techniques for optimal memory hierarchy for reducing the search-space for energy-efficient machine learning accelerator designs. He is also broadly interested in recommendation models and is currently exploring system and architecture-level techniques to accelerate their training. In his free time, he likes to play tennis and hike the majestic mountains near Vancouver
  3. (4:30pm – 4:45pm) Towards Disaggregated Memory Recommenders
    • Abstract: Recommendation systems have high memory capacity and bandwidth requirements. Disaggregated memory is an upcoming technology that can improve memory utilization, increase memory capacity and bandwidth and allow independent compute and memory scaling. We investigate the use of disaggregated memory for recommenders and we posit that this new technology can provide significant benefits for recommenders’ training and inference
    • Author list: Talha Imran, Nadav Amit, Irina Calciu
    • Speaker bio: Talha Imran is a M.Sc. student at Penn State CSE, working on datacenter scale systems. He is exploring how disaggregated systems may be adopted for modern service-oriented cloud deployments such as recommender services at scale and serverless computing. He has worked as a summer intern at VMware Research and Facebook. He also has prior industrial experience developing software tools for embedded systems. He is currently on the job market looking for full-time opportunities in the broad area of datacenter scale systems.
  4. (4:45pm – 5:00pm) Scalability, Latency, Flexibility: The Case for Similarity Search as a Service
    • Author list: Amir Sadoughi, Edo Liberty, Lior Ehrenfeld, Ron Begleiter, Fei Yu, Mark Chew, Jack Pertschuk, Roei Mutay, Greg Kogan, Beni Ran
    • Speaker bio: Amir Sadoughi, currently Head of Engineering at Pinecone, is a technical leader with over a decade of experience leading efforts in distributed systems. His career spans multiple industries including machine learning (Amazon SageMaker), cloud computing (AWS, Rackspace), and high-frequency trading (RGM Advisors). Amir holds a Bachelor of Science degree from UC Berkeley
    • Abstract: Modern deep learning models can represent arbitrary objects as vectors, also known as embeddings. Software applications can use these deep learning models and their respective embeddings to power a variety of use cases, including personalization, recommendation systems, image search, anomaly detection, and more. To date, software engineers could build these systems by integrating open source k-nearest neighbor libraries with an off-the-shelf web server. However, using such a solution presents serious challenges in the face of scalability, latency, and flexibility. To address these challenges, we built Pinecone, providing similarity search as a service
  5. (5:00pm – 5:15pm) Capacity-Driven Scale-Out Neural Recommendation: Enabling the Growing Scale of Recommendation
    • Author list: Mike Lui, Yavuz Yetim, Oz Ozkan, Zhuoran Zhao, Shin-Yeh Tsai, Carole-Jean Wu, Mark Hempstead
    • Abstract: Deep learning recommendation models have grown to the terabyte scale. Traditional, simple serving schemes–that load entire models to a single serve–are unable to support this scale.  One approach to support this scale is with distributed serving, or distributed inference, which divides the memory requirements of a single large model across multiple servers. Thus, we perform capacity-driven, scale-out, neural recommendation. This workshop talk will 1) introduce this problem, 2) discuss a first step distributed approach, and 3) discuss the design space for the systems research community to develop novel model-serving solutions. This approach, along with with parallel research into SSD based serving of latency-constrained models, and near-memory computing solutions, generate a large design space for serving such large models efficiently and at the scale of billions of users
    • Speaker bio: Mike Lui’s research at Drexel University focuses on new profiling and instrumentation methodologies, to help bridge the enduring software-hardware co-design gap for architects. Initially focused on the micro-architecture and system un-core, he is now applying his knowledge and experience to large data-center scale systems at Facebook
  6. (5:15pm – 5:30pm) Training with Multi-Layer Embeddings for Model Reduction
    • Author list: Benjamin Ghaemmaghami, Zihao Deng, Benjamin Cho, Leo Orshansky, Ashish Kumar Singh, Mattan Erez, Michael Orshansky
    • Abstract: Modern recommendation systems rely on real-valued embeddings of categorical features. Increasing the dimension of embedding vectors improves model accuracy but comes at a high cost to model size. We introduce a multi-layer embedding training (MLET) scheme that trains embeddings via a sequence of linear layers to derive a superior model accuracy vs. size trade-off. Our approach is fundamentally based on the ability of factorized linear layers to produce superior embeddings to that of a single linear layer. Harnessing recent results in dynamics of backpropagation in linear neural networks, we explain the superior performance obtained by multi-layer embeddings by their tendency to have lower effective rank. We show that substantial advantages are obtained in the regime where the width of the hidden layer is much larger than that of the final embedding vector dimension. Crucially, at the conclusion of training, we convert the two-layer solution into a single-layer one: as a result, the inference-time model size is unaffected by MLET. We prototype MLET across seven different open-source recommendation models. We show that it allows a reduction in vector dimension of up to 16x, and 5.8x on average, across the models. This reduction correspondingly improves inference memory footprint while preserving model accuracy
    • Speaker bio: Benjamin Ghaemmaghami is a 3rd year PhD student in ECE at UT Austin
  7. (5:30pm – 5:45pm) Towards Automated Neural Interaction Discovery for Click-Through Rate Prediction
    • Author list: Qingquan Song, Dehua Cheng, Hanning Zhou, Jiyan Yang, Yuandong Tian, Xia Hu
    • Abstract: Click-Through Rate (CTR) prediction is one of the most important machine learning tasks in recommender systems, driving personalized experience for billions of consumers. Neural architecture search (NAS), as an emerging field, has demonstrated its capabilities in discovering powerful neural network architectures, which motivates us to explore its potential for CTR predictions. Due to 1) diverse unstructured feature interactions, 2) heterogeneous feature space, and 3) high data volume and intrinsic data randomness, it is challenging to construct, search, and compare different architectures effectively for recommendation models. To address these challenges, we propose an automated interaction architecture discovering framework for CTR prediction named AutoCTR. Via modularizing simple yet representative interactions as virtual building blocks and wiring them into a space of direct acyclic graphs, AutoCTR performs evolutionary architecture exploration with learning-to-rank guidance at the architecture level and achieves acceleration using low-fidelity model. Empirical analysis demonstrates the effectiveness of AutoCTR on different datasets comparing to human-crafted architectures. The discovered architecture also enjoys generalizability and transferability among different datasets
    • Speaker bio: Qingquan Song got his Ph.D. degree in Computer Science and Engineering Department at Texas A&M University advised by Dr. Xia (Ben) Hu. His research focuses on automated machine learning, dynamic data analysis, tensor analysis, and their applications in recommender systems and social networks. He is also the author of the book Automated Machine Learning in Action by Manning publication, and the co-author of the AutoKeras toolkit