Low Power Autonomous & Smart Systems I – Invited Special Session

A4LC: Low Power Autonomous & Smart Systems I - Invited Special Session

Session Type: Lecture
Session Code: A4L-C
Location: Room 3
Date & Time: Wednesday March 22, 2023 (14:00 - 15:00)
Chair: Tinoosh Mohsenin
Track: 12
Paper IDPaper NameAuthorsAbstract
3099Neuromorphic Computing for Intelligent Autonomous Control ApplicationsCatherine SchumanNeuromorphic computing offers the opportunity for low-power, intelligent autonomous systems. However, effectively leveraging neuromorphic computers requires co-design of hardware, algorithms, and applications. In this talk, I will review our recent work on hardware-application co-design in neuromorphic computing. In particular, I will showcase our uses of neuromorphic computing for real-time, autonomous control tasks, including robotics, autonomous vehicles, and internal combustion engine control.
3214A Regression-Focused Fine-Tuning Technique for Low Energy Consumption and LatencyArnab Neelim Mazumder, Tinoosh MohseninFine-tuning deep neural networks (DNNs) is pivotal for creating inference modules that can be suitably imported to edge or FPGA (Field Programmable Gate Arrays) platforms. Traditionally, the exploration of different parameters throughout the layers of DNNs has been done using grid search and other brute force techniques. Though these methods lead to the optimal choice of network parameters, the search process can be very time-consuming and may not consider the deployment constraints of the target platform. This work addresses this problem by proposing a regression-focused fine-tuning approach that incorporates well-reasoned hardware-aware regression polynomials for a combination of DNN parameters. Next, we use the generated polynomials to perform fast profiling of different objectives and pinpoint a suitable configuration for deployment. Additionally, we provide details of the intuition behind developing hardware-aware regression polynomials irrespective of device platforms. Our deployments based on the fine-tuning method for physical activity recognition on FPGA indicate at least 5.7x better energy efficiency than recent implementations without any compromise in accuracy. Also, with this process, we observe a 74.3% drop in latency for semantic segmentation of aerial images on the Jetson Nano edge device compared to the baseline implementation.
3094Memory-Efficient Multi-Task Mixture-of-Experts Transformer: Algorithm and AccelerationRishov Sarkar, Cong HaoThe computer vision community is embracing two promising learning paradigms: the Vision Transformer (ViT) and Multi-task Learning (MTL). ViT models show extraordinary performance compared with traditional convolution networks but are commonly recognized as computation-intensive, especially the self-attention with quadratic complexity. MTL uses one model to infer multiple tasks and has demonstrated considerable success in computer vision applications such as autonomous driving and indoor robotics. MTL usually achieves better performance by enforcing shared representation among tasks, but a huge drawback is that most MTL regimes require activation of the entire model even when only one or a few tasks are needed, causing significant computing waste. From the algorithm perspective, we first discuss our latest multi-task ViT model that introduces mixture-of-experts (MoE), called M3ViT, where only a small portion of subnetworks (``experts\'\') are sparsely and dynamically activated based on the current task. M3ViT achieves better accuracy and over 80% computation reduction and paves the way for efficient real-time MTL using ViT. From the hardware perspective, despite the algorithmic advantages of MTL, ViT, and even M3ViT, there are still many challenges for efficient deployment on FPGA. For instance, in general Transformer/ViT models, the self-attention is known as computationally intensive and requires high bandwidth. In addition, softmax operations and the activation function GELU are extensively used, which unfortunately can consume more than half of the entire FPGA resource (LUTs). In the M3ViT model, the promising MoE mechanism for multi-task exposes new challenges for memory access overhead and also increases resource usage because of more layer types. To address these challenges in both general Transformer/ViT models and the state-of-the-art multi-task M3ViT with MoE, we propose the first end-to-end FPGA accelerator for multi-task ViT with a rich collection of architectural innovations. We deliver on-board implementation and measurement on Xilinx ZCU102 FPGA, with verified functionality and open-sourced hardware design, which achieves 1.56x and 3.40x better energy efficiency comparing with GPU and CPU, respectively.