A curated collection of academic papers related to foundation models. All papers in this collection are sorted based on the conference/journal name and the year of publication. Developers and researchers are welcome to contribute by adding more published papers to this list.
Key Words: foundation model, large-scale models, model training, model inference, pipeline parallelism, model parallelism, tensor parallelism, data parallelism, pre-training, fine-tuning, zero-shot, model compression, data compression, gradient compression, memory footprint reduction, batching, heterogeneous system, distributed system, network architecture
Table of Listed Conferences
-
ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming (PPoPP)
-
Conference on Neural Information Processing Systems (NeurIPS)
-
International Conference for High Performance Computing, Networking, Storage and Analysis (SC)
-
North American Chapter of the Association for Computational Linguistics (NAACL)
-
ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD)
-
USENIX Symposium on Operating Systems Design and Implementation (OSDI)
-
Conference on Empirical Methods in Natural Language Processing (EMNLP)
-
Annual Meeting of the Association for Computational Linguistics (ACL)
-
Association for the Advancement of Artificial Intelligence(AAAI)
-
IEEE International Conference on Computer Communications(INFOCOM)
-
IEEE International Parallel & Distributed Processing Symposium(IPDPS)
Table of Listed Journals
- IEEE Transactions on Parallel and Distributed Systems (TPDS)
- ACMComputingSurveys
- JournalofMachineLearningResearch
- Transactions on Machine Learning Research(TMLR)
Conferences
PPoPP
[dynamic GPU memory scheduling] Superneurons: dynamic GPU memory management for training deep neural networks. Linnan Wang, Jinmian Ye, Yiyang Zhao, Wei Wu, Ang Li, Shuaiwen Leon Song, Zenglin Xu, Tim Kraska. PPoPP'18
[training on supercomputer] BaGuaLu: Targeting Brain Scale Pretrained Models with over 37 Million Cores. Zixuan Ma1 , Jiaao He1 , Jiezhong Qiu1,4, Huanqi Cao1 , Yuanwei Wang1 , Zhenbo Sun1 , Liyan Zheng1 , Haojie Wang1 , Shizhi Tang1 , Tianyu Zheng3 , Junyang Lin2 , Guanyu Feng1 , Zeqiang Huang3 , Jie Gao3 , Aohan Zeng1,4, Jianwei Zhang2 , Runxin Zhong1 , Tianhui Shi1 , Sha Liu3 , Weimin Zheng1 , Jie Tang1,4, Hongxia Yang2 , Xin Liu3 , Jidong Zhai1 , Wenguang Chen1. PPoPP'22
[distributed MoE model training] FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models. Jiaao He, Jidong Zhai, Tiago Antunes, Haojie Wang, Fuwen Luo, Shangfeng Shi, Qin Li. PPoPP'22
[pipeline parallelism] Elastic Averaging for Efficient Pipelined DNN Training. Zihao Chen, Chen Xu, Weining Qian, Aoying Zhou. PPoPP'23
[sparse attention] Dynamic N M Fine-grained Structured Sparse Attention Mechanism. Zhaodong Chen, Zheng Qu, Yuying Quan, Liu Liu, Yufei Ding, Yuan Xie. PPoPP'23
[failure recovery] POSTER Swift Expedited Failure Recovery for Large-scale DNN Training. Yuchen Zhong, Guangming Sheng, Juncheng Liu, Jinhui Yuan, and Chuan Wu. PPoPP'23
[computation and communication overlap] Liger: Interleaving Intra- and Inter-Operator Parallelism for Distributed Large Model Inference. Jiangsu Du, PictureJinhui Wei, PictureJiazhi Jiang, PictureShenggan Cheng, PictureDan Huang, PictureZhiguang Chen, PictureYutong Lu. PPoPP'24
NIPS
[network architecture] Attention Is All You Need. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin. NIPS'17
[parallel decoding] Blockwise Parallel Decoding for Deep Autoregressive Models. Mitchell Stern, Noam Shazeer, Jakob Uszkoreit. NIPS'18
#pipeline_parallelism GPipe: efficient training of giant neural networks using pipeline parallelism. Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Mia Xu Chen, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, Zhifeng Chen. NIPS'19
[pre-training] Language Models are Few-Shot Learners. Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei. NIPS'20
[fine-tuning] COMPACTER:Efficient Low-Rank Hypercomplex Adapter Layers. Rabeeh Karimi Mahabadi, James Henderson, Sebastian Ruder. NIPS'21
[reinforcement learning] Decision Transformer: Reinforcement Learning via Sequence Modeling. Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, Igor Mordatch. NIPS'21
[moe routing] Hash Layers For Large Sparse Models. Stephen Roller, Sainbayar Sukhbaatar, Arthur Szlam, Jason E Weston. NIPS'21
[moe model] Scaling Vision with Sparse Mixture of Experts. Carlos Riquelme, Puigcerver, Basil Mustafa, Maxim Neumann, Rodolphe Jenatton, André Susano Pinto, Daniel Keysers, Neil Houlsby. NIPS'22
[length generalization] Exploring Length Generalization in Large Language Models. Cem Anil, Yuhuai Wu, Anders Andreassen, Aitor Lewkowycz, Vedant Misra, Vinay V. Ramasesh, Ambrose Slone, Guy Gur-Ari, Ethan Dyer, Behnam Neyshabur. NIPS'22
[model compression] XTC: Extreme Compression for Pre-trained Transformers Made Simple and Efficient. Xiaoxia Wu, Zhewei Yao, Minjia Zhang, Conglong Li, Yuxiong He. NIPS'22
[zero-shot] Generating Training Data with Language Models: Towards Zero-Shot Language Understanding. Yu Meng, Jiaxin Huang, Yu Zhang, Jiawei Han. NIPS'22
[memory footprint reduction] Tempo: Accelerating Transformer-Based Model Training through Memory Footprint Reduction. Muralidhar Andoorveedu, Zhanda Zhu, Bojian Zheng, Gennady Pekhimenko. NIPS'22
[model compression] ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers. Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, Yuxiong He. NIPS'22
[moe routing] Mixture-of-Experts with Expert Choice Routing. Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew Dai, Zhifeng Chen, Quoc Le, James Laudon. NIPS'22
[moe model] Towards Understanding the Mixture-of-Experts Layer in Deep Learning. Zixiang Chen, Yihe Deng, Yue Wu, Quanquan Gu, Yuanzhi Li. NIPS'22
[moe model] Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs. Jinguo Zhu, Xizhou Zhu, Wenhai Wang, Xiaohua Wang, Hongsheng Li, Xiaogang Wang, Jifeng Dai. NIPS'22
[rlhf] Fine-Grained Human Feedback Gives Better Rewards for Language Model Training. Zeqiu Wu, Yushi Hu, Weijia Shi, Nouha Dziri, Alane Suhr, Prithviraj Ammanabrolu, Noah A. Smith, Mari Ostendorf, Hannaneh Hajishirzi. NIPS'23
[rlhf] RRHF: Rank Responses to Align Language Models with Human Feedback without tears. Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, Fei Huang. NIPS'23
SC
[adaptive batching] BATCH: Machine Learning Inference Serving on Serverless Platforms with Adaptive Batching. Ahsan Ali, Riccardo Pinciroli, Feng Yan, Evgenia Smirni. SC'20
[memory optimizations] ZeRO: Memory Optimizations Toward Training Trillion Parameter Models. Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He. SC'20
#pipeline_parallelism #tensor_parallelism Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM. Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, Matei Zaharia. SC'21
#pipeline_parallelism Chimera: efficiently training large-scale neural networks with bidirectional pipelines. Shigang Li, Torsten Hoefler. SC'21
[heterogeneous system] ZeRO-infinity: breaking the GPU memory wall for extreme scale deep learning. Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, Yuxiong He. SC'21
[parallel matrix multiplication] CA3DMM: A New Algorithm Based on a Unified View of Parallel Matrix Multiplication. Hua Huang, Edmond Chow. SC'22
[GNN training] CoGNN: Efficient Scheduling for Concurrent GNN Training on GPUs. Qingxiao Sun, Yi Liu, Hailong Yang, Ruizhe Zhang, Ming Dun, Mingzhen Li, Xiaoyan Liu, Wencong Xiaoy, Yong Liy, Zhongzhi Luan, Depei Qian. SC'22
[model inference] #expert_parallelism DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale. Reza Yazdani Aminabadi, Samyam Rajbhandari, Minjia Zhang, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Jeff Rasley, Shaden Smith, Olatunji Ruwase, Yuxiong He, Microsoft Corporation. SC'22
[large-scale recommendation model training] EL-Rec: Efficient Large-Scale Recommendation Model Training via Tensor-Train Embedding Table. Zheng Wang, Yuke Wang, Boyuan Feng, Dheevatsa Mudigere, Bharath Muthiah, Yufei Ding. SC'22
[network topology] HammingMesh: A Network Topology for Large-Scale Deep Learning. Torsten Hoeflery, Tommaso Bonato, Daniele De Sensi, Salvatore Di Girolamo, Shigang Li, Marco Heddesy, Jon Belky, Deepak Goely, Miguel Castroy, and Steve Scotty. SC'22
[accelerate training] LightSeq2: Accelerated Training for Transformer-Based Models on GPUs. Xiaohui Wang, Yang Wei, Ying Xiong, Guyue Huang, Xian Qian, Yufei Ding, Mingxuan Wang, Lei Li. SC'22
[variability in accelerator-rich systems] Not All GPUs Are Created Equal: Characterizing Variability in Large-Scale, Accelerator-Rich Systems. Prasoon Sinha, Akhil Guliani, Rutwik Jain, Brandon Tran, Matthew D. Sinclair and Shivaram Venkataraman. SC'22
[DNN model training] STRONGHOLD: Fast and Affordable Billion-Scale Deep Learning Model Training. Xiaoyang Sun, Wei Wang, Shenghao Qiu, Renyu Yang, Songfang Huang, Jie Xu, Zheng Wang. SC'22
[GNN training] WholeGraph: A Fast Graph Neural Network Training Framework with Multi-GPU Distributed Shared Memory Architecture. Dongxu Yang, Junhong Liu, Jiaxing Qi, Junjie Lai. SC'22
[pipeline parallelism] Hanayo: Harnessing Wave-like Pipeline Parallelism for Enhanced Large Model Training Efficiency. Ziming Liu, Shenggan Cheng, Haotian Zhou, Yang You. SC'23
[elastic training system] EasyScale: Accuracy-consistent Elastic Training for Deep Learning. Mingzhen Li, Wencong Xiao, Biao Sun, Hanyu Zhao, Hailong Yang, Shiru Ren, Zhongzhi Luan, Xianyan Jia, Yi Liu, Yong Li, Wei Lin, Depei Qian. SC'23
ATC
[fine tuning] Cavs: An Efficient Runtime System for Dynamic Neural Networks. Shizhen Xu, Carnegie Mellon University, Tsinghua University; Hao Zhang, Graham Neubig, and Wei Dai, Carnegie Mellon University, Petuum Inc.; Jin Kyu Kim, Carnegie Mellon University; Zhijie Deng, Tsinghua University; Qirong Ho, Petuum Inc.; Guangwen Yang, Tsinghua University; Eric P. Xing, Petuum Inc. ATC'18
[cpu speedup] DeepCPU: Serving RNN-based Deep Learning Models 10x Faster. Minjia Zhang, Samyam Rajbhandari, Wenhan Wang Yuxiong He. ATC'18
[network resource share] DynaMix: Dynamic Mobile Device Integration for Efficient Cross-device Resource Sharing. Dongju Chae, POSTECH;Joonsung Kim and Gwangmu Lee, Seoul National University; Hanjun Kim, POSTECH; Kyung-Ah Chang and Hyogun Lee, Samsung Electronics;Jangwoo
Kim, Seoul National University. ATC'18
[IO mitigating in vCPU+cpu resource share] Effectively Mitigating I/O Inactivity in vCPU Scheduling. Weiwei Jia, The University of Hong Kong, New Jersey Institute of Technology; Cheng Wang and Xusheng Chen, The University of Hong Kong; Jianchen Shan and Xiaowei Shang, New Jersey Institute of Technology; Heming Cui, The University of Hong Kong; Xiaoning Ding, New Jersey Institute of Technology; Luwei Cheng, Facebook; Francis C. M. Lau and Yuexuan Wang, The University of Hong Kong; Yuangang Wang, Huawei. ATC'18
[ML distributed inference] Litz: Elastic Framework for High-Performance Distributed Machine Learning. Aurick Qiao, Petuum, Inc. and Carnegie Mellon University; Abutalib Aghayev, Carnegie Mellon University; Weiren Yu, Petuum, Inc. and Beihang University; Haoyang Chen and Qirong Ho, Petuum, Inc.; Garth A. Gibson, Carnegie Mellon
University and Vector Institute; Eric P. Xing, Petuum, Inc. and Carnegie Mellon University. ATC'18
[finetuning] Mainstream: Dynamic Stem-Sharing for Multi-Tenant Video Processing. Angela H. Jiang, Daniel L.K. Wong, Christopher Canel, Lilia Tang, and Ishan Misra, Carnegie Mellon University; Michael Kaminsky, Michael A. Kozuch, and Padmanabhan Pillai, Intel Labs; David G. Andersen and Gregory R. Ganger, Carnegie Mellon University. ATC'18
[instance placement] Placement of Virtual Containers on NUMA systems: A Practical and Comprehensive Model. Justin Funston, Maxime Lorrillere, and Alexandra Fedorova, University of British Columbia; Baptiste Lepers, EPFL; David Vengerov and Jean-Pierre Lozi, Oracle Labs; Vivien Quéma, IMAG. ATC'18
[sparse matrix operation] Locality-Aware Software Throttling for Sparse Matrix Operation on GPUs. Yanhao Chen and Ari B. Hayes, Rutgers University; Chi Zhang, University of Pittsburgh; Timothy Salmon and Eddy Z. Zhang, Rutgers University. ATC'18
[data compression] TerseCades: Efficient Data Compression in Stream Processing. Gennady Pekhimenko, University of Toronto; Chuanxiong Guo, Bytedance Inc.; Myeongjae Jeon, Microsoft Research; Peng Huang, Johns Hopkins University; Lidong Zhou, Microsoft Research. ATC'18
[spot resource usage] Tributary: spot-dancing for elastic services with latency SLOs. Aaron Harlap and Andrew Chung, Carnegie Mellon University; Alexey Tumanov,
UC Berkeley; Gregory R. Ganger and Phillip B. Gibbons, Carnegie Mellon University. ATC'18
[load balancing] The Battle of the Schedulers: FreeBSD ULE vs. Linux CFS. Justinien Bouron, Sebastien Chevalley, Baptiste Lepers, and Willy Zwaenepoel, EPFL; Redha Gouicem, Julia Lawall, Gilles Muller, and Julien Sopena, Sorbonne University/Inria/LIP6. ATC'18
[GPU analysis] Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads. Myeongjae Jeon, UNIST and Microsoft Research; Shivaram Venkataraman, University of Wisconsin and Microsoft Research; Amar Phanishayee and Junjie Qian, Microsoft Research; Wencong Xiao, Beihang University and Microsoft Research; Fan Yang, Microsoft Research. ATC'19
[model inference] Optimizing CNN Model Inference on CPUs. Yizhi Liu, Yao Wang, Ruofei Yu, Mu Li, Vin Sharma, and Yida Wang, Amazon. ATC'19
[schedule in CPU-GPU framework] FineStream: Fine-Grained Window-Based Stream Processing on CPU-GPU Integrated Architectures. Feng Zhang and Lin Yang, Renmin University of China; Shuhao Zhang, Technische Universität Berlin and National University of Singapore; Bingsheng He, National University of Singapore; Wei Lu and Xiaoyong Du, Renmin University of China. ATC'20
[offload workload between CPU and GPU] Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads. Gina Yuan, Shoumik Palkar, Deepak Narayanan, and Matei Zaharia, Stanford University. ATC'20
[DNN deployment system] ALERT: Accurate Learning for Energy and Timeliness. Chengcheng Wan, Muhammad Santriaji, Eri Rogers, Henry Hoffmann, Michael Maire, and Shan Lu, University of Chicago. ATC'20
[DNN running time predict] Daydream: Accurately Estimating the Efficacy of Optimizations for DNN Training. Hongyu Zhu, University of Toronto & Vector Institute; Amar Phanishayee, Microsoft Research; Gennady Pekhimenko, University of Toronto & Vector Institute. ATC'20
[DNN traning in GPU] HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism and Data Parallelism. Jay H. Park, Gyeongchan Yun, Chang M. Yi, Nguyen T. Nguyen, and Seungmin Lee, UNIST; Jaesik Choi, KAIST; Sam H. Noh and Young-ri Choi, UNIST. ATC'20
[multi DNN deployment platform] NeuOS: A Latency-Predictable Multi-Dimensional Optimization Framework for DNN-driven Autonomous Systems. Soroush Bateni and Cong Liu, University of Texas at Dallas. ATC'20
#pipeline_parallelism] Fine-tuning giant neural networks on commodity hardware with automatic pipeline model parallelism. Saar Eliad, Ido Hakimi, and Alon De Jagger, Department of Computer Science, Technion - Israel Institute of Technology; Mark Silberstein, Department of Computer
Science and Department of Electrical Engineering, Technion - Israel Institute of Technology; Assaf Schuster, Department of Computer Science, Technion - Israel Institute of Technology. ATC'21
[model-less inference serving] INFaaS: Automated Model-less Inference Serving. Francisco Romero, Qian Li, Neeraja J. Yadwadkar, and Christos Kozyrakis, Stanford University. ATC'21
[model running time predict] Habitat: A Runtime-Based Computational Performance Predictor for Deep Neural Network Training. Geoffrey X. Yu, University of Toronto/Vector Institute; Yubo Gao, University of Toronto; Pavel Golikov and Gennady Pekhimenko, University of Toronto/Vector Institute. ATC'21
[giant model training by offloading data and compute to cpu] ZeRO-Offload: Democratizing Billion-Scale Model Training. Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, Yuxiong He. ATC'21
[ML training in GPU with GPU share] Zico: Efficient GPU Memory Sharing for Concurrent DNN Training. Gangmuk Lim, UNIST; Jeongseob Ahn, Ajou University; Wencong Xiao, Alibaba Group; Youngjin Kwon, KAIST; Myeongjae Jeon, UNIST. ATC'21
[gaint model training in GPU] Whale: Efficient Giant Model Training over Heterogeneous GPUs. Xianyan Jia, Le Jiang, Ang Wang, and Wencong Xiao, Alibaba Group; Ziji Shi, National University of Singapore & Alibaba Group; Jie Zhang, Xinyuan Li, Langshi Chen, Yong Li, Zhen Zheng, Xiaoyong Liu, and Wei Lin, Alibaba Group. ATC'22
[ML failure recover in GPU] Sibylla: To Retry or Not To Retry on Deep Learning Job Failure. Taeyoon Kim, Suyeon Jeong, Jongseop Lee, Soobee Lee, and Myeongjae Jeon, UNIST. ATC'22
[mixed precision training] Campo: Cost-Aware Performance Optimization for Mixed-Precision Neural Network Training. Xin He, CSEE, Hunan University & Xidian University; Jianhua Sun and Hao Chen, CSEE, Hunan University; Dong Li, University of California, Merced. ATC'22
[DNN in GPU with batch] DVABatch: Diversity-aware Multi-Entry Multi-Exit Batching for Efficient Processing of DNN Services on GPUs. Weihao Cui, Han Zhao, Quan Chen, Hao Wei, and Zirui Li, Shanghai Jiao Tong University; Deze Zeng, China University of Geosciences; Chao Li and Minyi Guo, Shanghai Jiao Tong University. ATC'22
[transformer in GPU] Faith: An Efficient Framework for Transformer Verification on GPUs. Boyuan Feng, Tianqi Tang, Yuke Wang, Zhaodong Chen, Zheng Wang, Shu Yang, Yuan Xie, and Yufei Ding, University of California, Santa Barbara. ATC'22
[transformer inference] [PetS: A Unified Framework for Parameter-Efficient Transformers Serving](PetS: A Unified Framework for Parameter-Efficient Transformers Serving). Zhe Zhou, Peking University; Xuechao Wei, Peking University and Alibaba Group; Jiejing Zhang, Alibaba Group; Guangyu Sun, Peking University. ATC'22
[GPU share] PilotFish: Harvesting Free Cycles of Cloud Gaming with Deep Learning Training. Wei Zhang and Binghao Chen, Shanghai Jiao Tong University; Zhenhua Han, Microsoft Research Asia; Quan Chen, Shanghai Jiao Tong University; Peng Cheng, Fan Yang, Ran Shu, and Yuqing Yang, Microsoft Research; Minyi Guo, Shanghai Jiao Tong University. ATC'22
[temporal sharing-GPU share with SLO] Serving Heterogeneous Machine Learning Models on Multi-GPU Servers with Spatio-Temporal Sharing. Seungbeom Choi, Sunho Lee, Yeonjae Kim, Jongse Park, Youngjin Kwon, and Jaehyuk Huh, KAIST. ATC'22
[accelerate moe] Accelerating Distributed MoE Training and Inference with Lina. Jiamin Li, City University of Hong Kong; Yimin Jiang, ByteDance Inc.; Yibo Zhu, unaffiliated; Cong Wang, City University of Hong Kong;Hong Xu, The Chinese University of Hong Kong. ATC'23
[accelerating billion-scale GNN training] Legion Automatically Pushing the Envelope of Multi-GPU System for Billion-Scale GNN Training. Jie Sun, Collaborative Innovation Center of Artificial Intelligence, Zhejiang University, China; Li Su, Alibaba Group; Zuocheng Shi, Collaborative Innovation Center of Artificial Intelligence, Zhejiang University, China; Wenting Shen, Alibaba Group; Zeke Wang, Collaborative Innovation Center of Artificial Intelligence, Zhejiang University, China; Lei Wang, Alibaba Group; Jie Zhang, Collaborative Innovation Center of Artificial Intelligence, Zhejiang University, China; Yong Li, Wenyuan Yu, and Jingren Zhou, Alibaba Group; Fei Wu, Collaborative Innovation Center of Artificial Intelligence, Zhejiang University, China and Shanghai Institute for Advanced Study of Zhejiang University, China. ATC'23
[auto-parallelization in moe] SMARTMOE Efficiently Training Sparsely-Activated Models through Combining Offline and Online Parallelization. Mingshu Zhai, Jiaao He, Zixuan Ma, Zan Zong, Runqing Zhang, Jidong Zhai. ATC'23
[accelerate MoE] Accelerating Distributed MoE Training and Inference with Lina. Jiamin Li, Yimin Jiang, Yibo Zhu, Cong Wang, Hong Xu. ATC'23
[GPU fragmentation] Beware of Fragmentation: Scheduling GPU-Sharing Workloads with Fragmentation Gradient Descent. Qizhen Weng and Lingyun Yang, Hong Kong University of Science and Technology; Yinghao Yu, Alibaba Group and Hong Kong University of Science and Technology; Wei Wang, Hong Kong University of Science and Technology; Xiaochuan Tang, Guodong Yang, and Liping Zhang, Alibaba Group. ATC'23
[multiple DNN on device] Decentralized Application-Level Adaptive Scheduling for Multi-Instance DNNs on Open Mobile Devices. HH Sung, JA Chen, W Niu, J Guan, B Ren, X Shen. ATC'23
ASPLOS
[GPU memory management] Capuchin: Tensor-based GPU Memory Management for Deep Learning. Xuan Peng, Xuanhua Shi, Hulin Dai, Hai Jin, Weiliang Ma, Qian Xiong, Fan Yang, Xuehai Qian. ASPLOS'20
[swapping between GPU and CPU memory] SwapAdvisor: Pushing Deep Learning Beyond the GPU Memory Limit via Smart Swapping. Chien-Chin Huang, Gu Jin, Jinyang Li. ASPLOS'20
[distributed supernet training system] NASPipe: High Performance and Reproducible Pipeline Parallel Supernet Training via Causal Synchronous Parallelism. Shixiong Zhao, Fanxin Li, Xusheng Chen, Tianxiang Shen, Li Chen, Sen Wang, Nicholas Zhang, Cheng Li, Heming Cui. ASPLOS'22
[deep learning workload scheduler] Lucid: A Non-intrusive, Scalable and Interpretable Scheduler for Deep Learning Training Jobs. Qinghao Hu, Meng Zhang, Peng Sun, Yonggang Wen, Tianwei Zhang. ASPLOS'23
[distributed training framework] Optimus-CC: Efficient Large NLP Model Training with 3D Parallelism Aware Communication Compression. Jaeyong Song, Jinkyu Yim, Jaewon Jung, Hongsun Jang, Hyung-Jin Kim, Youngsok Kim, Jinho Lee. ASPLOS'23
[fine Tuning] Mobius: Fine Tuning Large-Scale Models on Commodity GPU Servers. Yangyang Feng, Minhui Xie, Zijie Tian, Shuo Wang, Youyou Lu, Jiwu Shu. ASPLOS'23
[GNN training] Betty: Enabling Large-Scale GNN Training with Batch-Level Graph Partitioning. Shuangyan Yang, Minjia Zhang, Wenqian Dong, Dong Li. ASPLOS'23
[overlap communication with computation] Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models. ASPLOS'23
[tensor management] DeepUM: Tensor Migration and Prefetching in Unified Memory. Jaehoon Jung, Jinpyo Kim, Jaejin Lee. ASPLOS'23
[tensor fusion] FLAT: An Optimized Dataflow forMitigating Attention Bottlenecks. Sheng-Chun Kao, Suvinay Subramanian, Gaurav Agrawal, Amir Yazdanbakhsh, Tushar Krishna. ASPLOS'23
[communication overlap] T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives. Suchita Pati, Shaizeen Aga, Mahzabeen Islam, Nuwan Jayasena, Matthew D. Sinclair. ASPLOS'24
[speculative inference] SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification. Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, Chunan Shi, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, Zhihao Jia. ASPLOS'24
[inference system] ExeGPT: Constraint-Aware Resource Scheduling for LLM Inference. Hyungjun Oh, Kihong Kim, Jaemin Kim, Sungkyun Kim, Junyeol Lee, Du-seong Chang, Jiwon Seo. ASPLOS'24
[inference system] Proteus: A High-Throughput Inference-Serving System with Accuracy Scaling. Sohaib Ahmad, Hui Guan, Brian D. Friedman, Nokia Bell Labs, Nokia Bell Labs, Ramesh K. Sitaraman, Thomas Woo. ASPLOS'24
[preemptible inference system] SpotServe: Serving Generative Large Language Models on Preemptible Instances. Xupeng Miao, Chunan Shi, Jiangfei Duan, Xiaoli Xi, Dahua Lin, Bin Cui, Zhihao Jia. ASPLOS'24
[pipeline parallelism] AdaPipe: Optimizing Pipeline Parallelism with Adaptive Recomputation and Partitioning. Zhenbo Sun, PictureHuanqi Cao, PictureYuanwei Wang, PictureGuanyu Feng, PictureShengqi Chen, PictureHaojie Wang, PictureWenguang Chen. ASPLOS'24
[memory optimization] MAGIS: Memory Optimization via Coordinated Graph Transformation and Scheduling for DNN. Renze Chen, PictureZijian Ding, PictureSize Zheng, PictureChengrui Zhang, PictureJingwen Leng, PictureXuanzhe Liu, PictureYun Liang. ASPLOS'24
[tensor partitioning] PrimePar: Efficient Spatial-temporal Tensor Partitioning for Large Transformer Model Training. Haoran Wang, PictureLei Wang, PictureHaobo Xu, PictureYing Wang, PictureYuming Li, PictureYinhe Han. ASPLOS'24
[layout transformation elimination] SmartMem: Layout Transformation Elimination and Adaptation for Efficient DNN Execution on Mobile. Wei Niu, Md Musfiqur Rahman Sanim, Zhihao Shu, Jiexiong Guan, Xipeng Shen, Miao Yin, Gagan Agrawal, Bin Ren. ASPLOS'24
[collective communication library] TCCL: Discovering Better Communication Paths for PCIe GPU Clusters. Heehoon Kim, PictureJunyeol Ryu, PictureJaejin Lee. ASPLOS'24
NAACL-HLT
[language representation model] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. NAACL-HLT'19
ICML
[fine-tuning] BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning. Asa Cooper Stickland, Iain Murray. ICML'19
[fine-tuning] Parameter-Efficient Transfer Learning for NLP. Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Laroussilhe, Andrea Gesmundo, Mona Attariyan, Sylvain Gelly. ICML'19
[pipeline-parallel] Memory-Efficient Pipeline-Parallel DNN Training. Deepak Narayanan, Amar Phanishayee, Kaiyu Shi, Xie Chen, Matei Zaharia. ICML'21
[scale language models with MoE] GLaM: Efficient Scaling of Language Models with Mixture-of-Experts. Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten Bosma, Zongwei Zhou, Tao Wang, Yu Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathleen Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc V Le, Yonghui Wu, Zhifeng Chen, Claire Cui. ICML'22
[MoE training and inference] #expert_parallelism DeepSpeed-MoE Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale. Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi. ICML'22
[robustness to attack] Deploying Convolutional Networks on Untrusted Platforms Using 2D Holographic Reduced Representations. Mohammad Mahmudul Alam, Edward Raff, Tim Oates, James Holt. ICML'22
[transfer learning] Optimistic Linear Support and Successor Features as a Basis for Optimal Policy Transfer. Lucas N. Alegre, Ana L. C. Bazzan, Bruno C. da Silva. ICML'22
[explainable ai] XAI for Transformers: Better Explanations through Conservative Propagation. Ameen Ali, Thomas Schnake, Oliver Eberle, Gregoire Montavon, Klaus Robert Muller, Lior Wolf. ICML'22
[LLM serving system] Fast Inference from Transformers via Speculative Decoding. Yaniv Leviathan, Matan Kalman, Yossi Matias. ICML'23
[offload in inference] FlexGen High-Throughput Generative Inference of Large Language Models with a Single GPU. Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y. Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher ReIon Stoica, Ce Zhang. ICML'23
[KV cache eviction policy] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models. Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang Wang, Beidi Chen. ICML'23
[pipeline parallelism] BPIPE: Memory-Balanced Pipeline Parallelism for Training Large Language Models. Taebum Kim, Hyoungjoo Kim,Gyeong-In Yu, Byung-Gon Chun. ICML'23
EUROSYS
[graph sampling on GPUs] Accelerating Graph Sampling for Graph Machine Learning using GPUs. Abhinav Jangda, Sandeep Polisetty, Arjun Guha, Marco Serafini. EUROSYS'21
[distributed, fair share scheduler] Balancing Efficiency and Fairness in Heterogeneous GPU Clusters for Deep Learning. Shubham Chaudhary, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, Nipun Kwatra. EUROSYS'21
[distributed framework for GNN training] FlexGraph: A Flexible and Efficient Distributed Framework for GNN Training. Lei Wang, Qiang Yin, Chao Tian, Jianbang Yang, Rong Chen, Wenyuan Yu, Zihang Yao, Jingren Zhou. EUROSYS'21
#pipeline_parallelism [giant model training in cluster] Varuna: scalable, low-cost training of massive deep learning models. Sanjith Athlur, Nitika Saran, Muthian Sivathanu, Ramachandran Ramjee, Nipun Kwatra. EUROSYS'22
[GPU resource usage] GNNLab: A Factored System for Sample-based GNN Training over GPUs. Jianbang Yang, Dahai Tang, Xiaoniu Song, Lei Wang, Qiang Yin, Rong Chen, Wenyuan Yu, Jingren Zhou. EUROSYS'22
[GPU-resident cache] Fleche: An Efficient GPU Embedding Cache for Personalized Recommendations. Minhui Xie, Youyou Lu, Jiazhen Lin, Qing Wang, Jian Gao, Kai Ren, Jiwu Shu. EUROSYS'22
[training DNN in GPU] #pipeline_parallelism Out-Of-Order BackProp: An Effective Scheduling Technique for Deep Learning. Hyungjun Oh, Junyeol Lee, Hyeongju Kim, Jiwon Seo. EUROSYS'22
[inference system] Tabi: An Efficient Multi-Level Inference System for Large Language Models. Yiding Wang, Kai Chen, Haisheng Tan, Kun Guo. EUROSYS'23
[serving with direct-host-access] Fast and Efficient Model Serving Using Multi-GPUs with Direct-Host-Access. Jinwoo Jeong, Seungsu Baek, Jeongseob Ahn. EUROSYS'23
[gradient compression] Hi-Speed DNN Training with Espresso Unleashing the Full Potential of Gradient Compression with Near-Optimal Usage Strategies. Zhuang Wang, Haibin Lin, Yibo Zhu, T. S. Eugene Ng. EUROSYS'23
[memory-efficient training] Accordion: Memory-Efficient DNN Training Using Adaptive Local Learning. Dhananjay Saikumar,Blesson Varghese. EUROSYS'24
[multi-task training] DynaPipe: Optimizing Multi-task Training through Dynamic Pipelines. Chenyu Jiang, Zhen Jia, Shuai Zheng, Yida Wang, Chuan Wu. EUROSYS'24
[auto parallelism] Aceso: Efficient Parallel DNN Training through Iterative Bottleneck Alleviation. Guodong Liu, PictureYoushan Miao, PictureZhiqi Lin, PictureXiaoxiang Shi, PictureSaeed Maleki, PictureFan Yang, PictureYungang Bao, PictureSa Wang. EUROSYS'24
[moe training system] ScheMoE: An Extensible Mixture-of-Experts Distributed Training System with Tasks Scheduling. Shaohuai Shi, Xinglin Pan, Qiang Wang, Chengjian Liu, Xiaozhe Ren, Zhongzhe Hu, Yu Yang, Bo Li, Xiaowen Chu. EUROSYS'24
KDD
[large model training] DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters. Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, Yuxiong He. KDD'20
OSDI
[distributed DNN training] A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters. Yimin Jiang, Tsinghua University and ByteDance; Yibo Zhu, ByteDance; Chang Lan, Google; Bairen Yi, ByteDance; Yong Cui, Tsinghua University; Chuanxiong Guo, ByteDance. osdi'20
[dynamic scaling on GPU clusters] AntMan: Dynamic Scaling on GPU Clusters for Deep Learning. Wencong Xiao, Shiru Ren, Yong Li, Yang Zhang, Pengyang Hou, Zhi Li, Yihui Feng, Wei Lin, and Yangqing Jia, Alibaba Group. osdi'20
[CPU scheduler] Caladan: Mitigating Interference at Microsecond Timescales. Joshua Fried and Zhenyuan Ruan, MIT CSAIL; Amy Ousterhout, UC Berkeley; Adam Belay, MIT CSAIL. osdi'20
[heterogeneity aware scheduler] Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads. Deepak Narayanan and Keshav Santhanam, Stanford University and Microsoft Research; Fiodar Kazhamiaka, Stanford University; Amar Phanishayee, Microsoft Research; Matei Zaharia, Stanford University. osdi'20
[framework to share a GPU cluster safely] HiveD: Sharing a GPU Cluster for Deep Learning with Guarantees. Hanyu Zhao, Zhenhua Han, Zhi Yang, Quanlu Zhang, Fan Yang, Lidong Zhou, Mao Yang, Francis C.M. Lau, Yuqi Wang, Yifan Xiong, Bin Wang. osdi'20
[adaptive training] KungFu: Making Training in Distributed Machine Learning Adaptive. Luo Mai, Guo Li, Marcel Wagenländer, Konstantinos Fertakis, Andrei-Octavian Brabete, and Peter Pietzuch, Imperial College London. osdi'20
[rack-scale computer scheduler] RackSched: A Microsecond-Scale Scheduler for Rack-Scale Computers. Hang Zhu, Johns Hopkins University; Kostis Kaffes, Stanford University; Zixu Chen, Johns Hopkins University; Zhenming Liu, College of William and Mary; Christos Kozyrakis, Stanford University; Ion Stoica, UC Berkeley; Xin Jin, Johns Hopkins University. osdi'20
[distributed model serving system] Serving DNNs like Clockwork: Performance Predictability from the Bottom Up. Arpan Gujarati, Reza Karimi, Safya Alzayat, Wei Hao, Antoine Kaufmann, Ymir Vigfusson. osdi'20
[distributed system for training GNNs] Dorylus: Affordable, Scalable, and Accurate GNN Training with Distributed CPU Servers and Serverless Threads. John Thorpe, Yifan Qiao, Jonathan Eyolfson, and Shen Teng, UCLA; Guanzhou Hu, UCLA and University of Wisconsin, Madison; Zhihao Jia, CMU; Jinliang Wei, Google Brain; Keval Vora, Simon Fraser; Ravi Netravali, Princeton University; Miryung Kim and Guoqing Harry Xu, UCLA. osdi'21
[GNN acceleration] GNNAdvisor: An Adaptive and Efficient Runtime System for GNN Acceleration on GPUs. Yuke Wang, Boyuan Feng, Gushu Li, Shuangchen Li, Lei Deng, Yuan Xie, and Yufei Ding, University of California, Santa Barbara. osdi'21
[embeddings of large-scale graphs] Marius: Learning Massive Graph Embeddings on a Single Machine. Jason Mohoney and Roger Waleffe, University of Wisconsin-Madison; Henry Xu, University of Maryland, College Park; Theodoros Rekatsinas and Shivaram Venkataraman, University of Wisconsin-Madison. osdi'21
[schedulingin deep learning clusters] Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning. Aurick Qiao, Petuum, Inc. and Carnegie Mellon University; Sang Keun Choe and Suhas Jayaram Subramanya, Carnegie Mellon University; Willie Neiswanger, Petuum, Inc. and Carnegie Mellon University; Qirong Ho, Petuum, Inc.; Hao Zhang, Petuum, Inc. and UC Berkeley; Gregory R. Ganger, Carnegie Mellon University; Eric P. Xing, MBZUAI, Petuum, Inc., and Carnegie Mellon University. osdi'21
[distributed serving system] Orca: A Distributed Serving System for Transformer-Based Generative Models. Gyeong-In Yu and Joo Seong Jeong, Seoul National University; Geon-Woo Kim, FriendliAI and Seoul National University; Soojeong Kim, FriendliAI; Byung-Gon Chun, FriendliAI and Seoul National University. osdi'22
[recommender system] Ekko: A Large-Scale Deep Learning Recommender System with Low-Latency Model Update. Chijun Sima, Tencent; Yao Fu and Man-Kit Sit, The University of Edinburgh; Liyi Guo, Xuri Gong, Feng Lin, Junyu Wu, Yongsheng Li, and Haidong Rong, Tencent; Pierre-Louis Aublin, IIJ research laboratory; Luo Mai, The University of Edinburgh. osdi'22
[resource sensitive scheduler for shared GPU clusters] Looking Beyond GPUs for DNN Scheduling on Multi-Tenant Clusters. Jayashree Mohan, Amar Phanishayee, and Janardhan Kulkarni, Microsoft Research; Vijay Chidambaram, The University of Texas at Austin and VMware Research. osdi'22
[distributed DNN training] Unity: Accelerating DNN Training Through Joint Optimization of Algebraic Transformations and Parallelization. Colin Unger, Stanford University; Zhihao Jia, Carnegie Mellon University and Meta; Wei Wu, Los Alamos National Laboratory and NVIDIA; Sina Lin, Microsoft; Mandeep Baines, Carlos Efrain Quintero Narvaez, Vinay Ramakrishnaiah, Nirmal Prajapati, Pat McCormick, Jamaludin Mohd-Yusof, Xi Luo, Dheevatsa Mudigere, Jongsoo Park, Misha Smelyanskiy, Alex Aiken. osdi'22
[DNN inference] Microsecond-scale Preemption for Concurrent GPU-accelerated DNN Inferences. Mingcong Han, Institute of Parallel and Distributed Systems, SEIEE, Shanghai Jiao Tong University; Shanghai AI Laboratory; Hanze Zhang, Institute of Parallel and Distributed Systems, SEIEE, Shanghai Jiao Tong University; MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, China; Rong Chen, Institute of Parallel and Distributed Systems, SEIEE, Shanghai Jiao Tong University; Shanghai AI Laboratory; Haibo Chen, Institute of Parallel and Distributed Systems, SEIEE, Shanghai Jiao Tong University; Engineering Research Center for Domain-specific Operating Systems, Ministry of Education, China. osdi'22
[device-cloud collaborative machine learning] Walle: An End-to-End, General-Purpose, and Large-Scale Production System for Device-Cloud Collaborative Machine Learning. Chengfei Lv, Zhejiang University and Alibaba Group; Chaoyue Niu, Shanghai Jiao Tong University and Alibaba Group; Renjie Gu, Xiaotang Jiang, Zhaode Wang, Bin Liu, Ziqi Wu, Qiulin Yao, Congyu Huang, Panos Huang, Tao Huang, Hui Shu, Jinde Song, Bin Zou, Peng Lan, and Guohuan Xu, Alibaba Group; Fei Wu, Zhejiang University; Shaojie Tang, University of Texas at Dallas; Fan Wu and Guihai Chen, Shanghai Jiao Tong University. osdi'22
#hybrid_parallelism Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning. Lianmin Zheng, Zhuohan Li, and Hao Zhang, UC Berkeley; Yonghao Zhuang, Shanghai Jiao Tong University; Zhifeng Chen and Yanping Huang, Google; Yida Wang, Amazon Web Services; Yuanzhong Xu, Google; Danyang Zhuo, Duke University; Eric P. Xing, MBZUAI and Carnegie Mellon University; Joseph E. Gonzalez and Ion Stoica, UC Berkeley. osdi'22
[model parallelism] AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving. Zhuohan Li and Lianmin Zheng, UC Berkeley; Yinmin Zhong, Peking University; Vincent Liu, University of Pennsylvania; Ying Sheng, Stanford University; Xin Jin, Peking University; Yanping Huang and Zhifeng Chen, Google; Hao Zhang, UC San Diego; Joseph E. Gonzalez and Ion Stoica, UC Berkeley. osdi'23
[deep learning compiler] Welder Scheduling Deep Learning Memory Access via Tile-graph. Yining Shi, Zhi Yang, Jilong Xue, Lingxiao Ma, Yuqing Xia, Ziming Miao, Yuxiao Guo, Fan Yang, Lidong Zhou. osdi'23
[scheduling DNN computational graphs] Effectively Scheduling Computational Graphs of Deep Neural Networks toward Their Domain-Specific Accelerators. Jie Zhao, Siyuan Feng, Xiaoqiang Dan, Fei Liu, Chengke Wang, Sheng Yuan, Wenyuan Lv, and Qikai Xie. osdi'23
[inference latency] ServerlessLLM: Locality-Enhanced Serverless Inference for Large Language Models. Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, Luo Mai. osdi'24
[inference latency] DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, Hao Zhang. osdi'24
[inference latency] Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve. Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, Ramachandran Ramjee. osdi'24
[fair inference] Fairness in Serving Large Language Models. Ying Sheng, Shiyi Cao, Dacheng Li, Banghua Zhu, Zhuohan Li, Danyang Zhuo, Joseph E. Gonzalez, Ion Stoica. osdi'24
SOSP
#pipeline_parallelism PipeDream: generalized pipeline parallelism for DNN training. Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger, Phillip B. Gibbons, Matei Zaharia. SOSP'19
[gradient compression] Gradient Compression Supercharged High-Performance Data Parallel DNN Training. Youhui Bai, Cheng Li, Quan Zhou, Jun Yi, Ping Gong, Feng Yan, Ruichuan Chen, Yinlong Xu. SOSP'21
[sharing of KV cache] Efficient Memory Management for Large Language Model Serving with PagedAttention. Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica. SOSP'23
[checkpoint] GEMINI: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints. Zhuang Wang, Zhen Jia, Shuai Zheng, Zhen Zhang, Xinwei Fu, T. S. Eugene Ng, Yida Wang. SOSP'23
ICPP
#tensor_parallelism Tesseract: Parallelize the Tensor Parallelism Efficiently. Boxiang Wang, Qifan Xu, Zhengda Bian, Yang You. ICPP'22
[multiple inference tasks sharing single GPU] SPLIT: QoS-Aware DNN Inference on Shared GPU via Evenly-Sized Model Splitting. Diaohan Luo, Tian Yu, Yuewen Wu, Heng Wu, Tao Wang, Wenbo Zhang. ICPP'23
[efficient all-reduce] Wrht: Efficient All-reduce for Distributed DNN Training in Optical Interconnect Systems. Fei Dai, Yawen Chen, Zhiyi Huang, Haibo Zhang. ICPP'23
[cpu offload] CoTrain: Efficient Scheduling for Large-Model Training upon GPU and CPU in Parallel. Zhenxing Li, Qiang Cao, Yajie Chen, Wenrui Yan. ICPP'23
[efficient communication in ddl] OSP: Boosting Distributed Model Training with 2-stage Synchronization. Zixuan Chen, Lei Shi, Xuandong Liu, Jiahui Li, Sen Liu, Yang Xu. ICPP'23
[automatic parallelization] Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training. Shenggui Li, Hongxin Liu, Zhengda Bian, Jiarui Fang, Haichen Huang, Yuliang Liu, Boxiang Wang, Yang You. ICPP'23
EMNLP
[fine-tuning] AdapterHub: A Framework for Adapting Transformers. Jonas Pfeiffer, Andreas Rücklé, Clifton Poth, Aishwarya Kamath, Ivan Vulić, Sebastian Ruder, Kyunghyun Cho, Iryna Gurevych. EMNLP'20
[fine-tuning] The Power of Scale for Parameter-Efficient Prompt Tuning. Brian Lester, Rami Al-Rfou, Noah Constant. EMNLP'21
ACL
[fine-tuning] Prefix-Tuning: Optimizing Continuous Prompts for Generation. Xiang Lisa Li, Percy Liang. ALC'21
[fine-tuning] Making Pre-trained Language Models Better Few-shot Learners. Ziyun Xu, Chengyu Wang, Minghui Qiu, Fuli Luo, Runxin Xu, Songfang Huang, Jun Huang. ACL'21
[fine-tuning] Parameter-efficient Multi-task Fine-tuning for Transformers via Shared Hypernetworks. Rabeeh Karimi Mahabadi, Sebastian Ruder, Mostafa Dehghani, James Henderson. ACL'21
[fine-tuning] MSP: Multi-Stage Prompting for Making Pre-trained Language Models Better Translators. Zhixing Tan, Xiangwen Zhang, Shuo Wang, Yang Liu. ACL'22
[fine-tuning] Knowledgeable Prompt-tuning: Incorporating Knowledge into Prompt Verbalizer for Text Classification. Shengding Hu, Ning Ding, Huadong Wang, Zhiyuan Liu, Jingang Wang, Juanzi Li, Wei Wu, Maosong Sun. ACL'22
[routing strategy for MoE] StableMoE: Stable Routing Strategy for Mixture of Experts. Damai Dai, Li Dong, Shuming Ma, Bo Zheng, Zhifang Sui, Baobao Chang, Furu Wei. ACL'22
[large language model] What Language Model to Train if You Have One Million GPU Hours?. Teven Le Scao, Thomas Wang, Daniel Hesslow, Stas Bekman, M Saiful Bari, Stella Biderman, Hady Elsahar, Niklas Muennighoff, Jason Phang, Ofir Press, Colin Raffel, Victor Sanh, Sheng Shen, Lintang Sutawika, Jaesung Tae, Zheng Xin Yong, Julien Launay, Iz Beltagy. ACL'22
ICLR
[moe model] Learning Factored Representations in a Deep Mixture of Experts. David Eigen, Marc'Aurelio Ranzato, Ilya Sutskever. ICLR'14
[moe model] Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. Noam Shazeer, *Azalia Mirhoseini, *Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean. ICLR'17
[large mini-batches] LARGE BATCH OPTIMIZATION FOR DEEP LEARNING: TRAINING BERT IN 76 MINUTES. Yang You, Jing Li, Sashank J. Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, Cho-Jui Hsieh. ICLR'20
[scaling giant models] #expert_parallelism Gshard: Scaling Giant Models with Conditional Computation and Automatic Sharding. Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, Zhifeng Chen. ICLR'21
[transformer for image recognition] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. ICLR'21
[expert-based model] Taming Sparsely Activated Transformer with Stochastic Experts. Simiao Zuo, Xiaodong Liu, Jian Jiao, Young Jin Kim, Hany Hassan, Ruofei Zhang, Tuo Zhao, Jianfeng Gao. ICLR'22
[large language model] GLM-130B: AN OPEN BILINGUAL PRE-TRAINED MODEL. Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, Weng Lam Tam, Zixuan Ma, Yufei Xue, Jidong Zhai, Wenguang Chen, Zhiyuan Liu, Peng Zhang, Yuxiao Dong, Jie Tang. ICLR'23
[transformer block] Brainformers Trading Simplicity for Efficiency. Yanqi Zhou, Nan Du, Yanping Huang, Daiyi Peng, Chang Lan, Da Huang, Siamak Shakeri, David So, Andrew Dai, Yifeng Lu, Zhifeng Chen, Quoc Le, Claire Cui, James Laundon, Jeff Dean. ICLR'23
[rlhf] SAFE RLHF: SAFE REINFORCEMENT LEARNING FROM HUMAN FEEDBACK. Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, Yaodong Yang. ICLR'24
VLDB
#data_parallelism PyTorch distributed: experiences on accelerating data parallel training. Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, Soumith Chintala. VLDB'20
#data_parallelism PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel. VLDB'23
#hybrid_parallelism Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism Xupeng Miao / Yujie Wang / Youhe Jiang / Chunan Shi / Xiaonan Nie / Hailin Zhang / Bin CuiProc. VLDB'2022
NSDI
[workload analysis and scheduling] MLaaS: in the Wild Workload Analysis and Scheduling in Large-Scale Heterogeneous. Qizhen Weng, Hong Kong University of Science and Technology and Alibaba Group; Wencong Xiao, Alibaba Group; Yinghao Yu, Alibaba Group and Hong Kong University of Science and Technology; Wei Wang, Hong Kong University of Science and Technology; Cheng Wang, Jian He, Yong Li, Liping Zhang, Wei Lin, and Yu Ding, Alibaba Group. nsdi'22
[checkpoint] Check-N-Run: a Checkpointing System for Training Deep Learning Recommendation Models. Assaf Eisenman, Kiran Kumar Matam, Steven Ingram, Dheevatsa Mudigere, Raghuraman Krishnamoorthi, Krishnakumar Nair, and Misha Smelyanskiy, Facebook; Murali Annavaram, Facebook and USC. nsdi'22
[checkpoint] Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs. John Thorpe, Pengzhan Zhao, Jonathan Eyolfson, and Yifan Qiao, UCLA; Zhihao Jia, CMU; Minjia Zhang, Microsoft Research; Ravi Netravali, Princeton University; Guoqing Harry Xu, UCLA. nsdi'23
AAAI
[model framework] Go Wider Instead of Deeper. Fuzhao Xue, Ziji Shi, Futao Wei, Yuxuan Lou, Yong Liu, Yang You. AAAI'22
ICCV
[vision transformer] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo. ICCV'21
FAST
[offloading data to SSD] FlashNeuron: SSD-Enabled Large-Batch Training of Very Deep Neural Networks. Jonghyun Bae, Seoul National University; Jongsung Lee, Seoul National University and Samsung Electronics; Yunho Jin and Sam Son, Seoul National University; Shine Kim, Seoul National University and Samsung Electronics; Hakbeom Jang, Samsung Electronics; Tae Jun Ham and Jae W. Lee, Seoul National University. FAST'21
[checkpoint] CheckFreq: Frequent, Fine-Grained DNN Checkpointing. Jayashree Mohan, UT Austin; Amar Phanishayee, Microsoft Research; Vijay Chidambaram, UT Austin and VMware research. FAST'21
ISCA
[GPU memory expansion] Buddy Compression: Enabling Larger Memory for Deep Learning and HPC Workloads on GPUs. Esha Choukse, Michael B. Sullivan, Mike O’Connor, Mattan Erez, Jeff Pool, David Nellans, Stephen W. Keckler. ISCA'20
HPCA
[tensor management] Sentinel: Efficient Tensor Migration and Allocation on Heterogeneous Memory Systems for Deep Learning. Jie Ren, Jiaolin Luo, Kai Wu, Minjia Zhang, Hyeran Jeon, Dong Li. HPCA'21
[memory-saving] MPress: Democratizing Billion-Scale Model Training on Multi-GPU Servers via Memory-Saving Inter-Operator Parallelism. Quan Zhou, Haiquan Wang, Xiaoyan Yu, Cheng Li, Youhui Bai, Feng Yan, Yinlong. HPCA'23
[alleviate PCIe channel contention] Tensor Movement Orchestration in Multi-GPU Training Systems. Shao-Fu Lin, Yi-Jung Chen, Hsiang-Yun Cheng, Chia-Lin Yang. HPCA'23
[operators fusion] Chimera An Analytical Optimizing Framework for Effective Compute-intensive Operators Fusion. Size Zheng, Siyuan Chen, Peidi Song, Renze Chen, Xiuhong Li, Shengen Yan, Dahua Lin, Jingwen Leng, Yun Liang. HPCA'23
[reducing computation complexity in attention] CTA Hardware-Software Co-design for Compressed Token Attention Mechanism. Haoran Wang, Haobo Xu, Ying Wang, Yinhe Han. HPCA'23
[unified virtual memory] Trans-FW Short Circuiting Page Table Walk in Multi-GPU Systems via Remote Forwarding Bingyao Li, Jieming Yin, Anup Holey, Youtao Zhang, Jun Yang, Xulong Tang. HPCA'23
[boosting the inference efficiency of ViTs] ViTALiTy Unifying Low-rank and Sparse Approximation for Vision Transformer Acceleration with a Linear Taylor Attention. Jyotikrishna Dass, Shang Wu, Huihong Shi, Chaojian Li, Zhifan Ye, Zhongfeng Wang, Yingyan Lin. HPCA'23
INFOCOM
[hide the communication] PipeMoE: Accelerating Mixture-of-Experts through Adaptive Pipelining. Shaohuai Shi, Xinglin Pan, Xiaowen Chu, Bo Li. INFOCOM'23
IPDPS
[accelerate MoE training] MPipeMoE: Memory Efficient MoE for Pre-trained Models with Adaptive Pipeline Parallelism. Zheng Zhang, Donglin Yang, Yaqi Xia, Liang Ding, Dacheng Tao, Xiaobo Zhou, Dazhao Cheng. IPDPS'23
An Efficient 2D Method for Training Super-Large Deep Learning Models Qifan Xu / Shenggui Li / Chaoyu Gong / Yang You IPDPS'2023
MLSys
[memory reuse] SAFE OPTIMIZED STATIC MEMORY ALLOCATION FOR PARALLEL DEEP LEARNING. Ioannis Lamprou, Zhen Zhang, Javier de Juan, Hang Yang, Yongqiang Lai, Etienne Filhol, Cedric Bastoul. MLSys'23
[moe kernel] MEGABLOCKS: EFFICIENT SPARSE TRAINING WITH MIXTURE-OF-EXPERTS. Trevor Gale, Deepak Narayanan, Cliff Young, Matei Zaharia. MLSys'23
[gpu-cpu memory swap] μ-TWO: 3× Faster Multi-Model Training with Orchestration and Memory Optimization. Sanket Purandare, Abdul Wasay, Stratos Idreos, NameError. MLSys'23
[communication system] On Optimizing the Communication of Model Parallelism. Yonghao Zhuang, Hexu Zhao, Lianmin Zheng, Zhuohan Li, Eric P. Xing, Qirong Ho, Joseph E. Gonzalez, Ion Stoica, Hao Zhang. MLSys'23
[parameter partitioning] Efficiently Scaling Transformer Inference. Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Anselm Levskaya, Jonathan Heek, Kefan Xiao, Shivani Agrawal, Jeff Dean. MLSys'23
[operator fusion] Transcending runtime-memory tradeoffs in checkpointing by being fusion aware. Horace He, Shangdi Yu. MLSys'23
[pipeline parallelism] Breadth-First Pipeline Parallelism. Joel Lamy-Poirier. MLSys'23
[adaptive parallelism and pipelining] Tutel: Adaptive Mixture-of-Experts at Scale. Changho Hwang, Wei Cui, Yifan Xiong, Ziyue Yang, Ze Liu, Han Hu, Zilong Wang, Rafael Salas, Jithin Jose, Prabhat Ram, Joe Chau, Peng Cheng, Fan Yang, Mao Yang, Yongqiang Xiong. MLSys'23
SIGCOMM
[moe all2all] Janus: A Unified Distributed Training Framework for Sparse Mixture-of-Experts Models. Juncai Liu, Jessie Hui Wang, Yimin Jiang. SIGCOMM'23
Journals
TPDS
[parallel training] PatrickStar: Parallel Training of Pre-Trained Models via Chunk-Based Dynamic Memory Management. Jiarui Fang, Zilin Zhu, Shenggui Li, Hui Su, Yang Yu, Jie Zhou, Yang You. TPDS'23
[3D parallel] Merak: An Efficient Distributed DNN Training Framework With Automated 3D Parallelism for Giant Foundation Models. Zhiquan Lai, Shengwei Li, Xudong Tang, Keshi Ge, Weijie Liu, Yabo Duan, Linbo Qiao, Dongsheng Li. TPDS'23
ACMComputingSurveys
[fine-tuning] Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, Graham Neubig. ACM Computing Surveys'23
JournalofMachineLearningResearch
[training with lower precision] Switch Transformers Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. William Fedus, Barret Zoph, Noam Shazeer. Journal of Machine Learning Research'22
TMLR
[rlhf] RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment. Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum, Tong Zhang. TMLR'23
arxiv
[inference on edge device] Once-for-All: Train One Network and Specialize it for Efficient Deployment. Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, Song Han. 19
[checkpoint] On Efficient Constructions of Checkpoints. Yu Chen, Zhenming Liu, Bin Ren, Xin Jin. 20
[checkpoint] A Study of Checkpointing in Large Scale Training of Deep Neural Networks. Elvis Rojas, Albert Njoroge Kahira, Esteban Meneses, Leonardo Bautista Gomez, Rosa M Badia. 20
[checkpoint] ECRM: Efficient Fault Tolerance for Recommendation Model Training via Erasure Coding. Kaige Liu, Jack Kosaian, K. V. Rashmi. 21
[moe model] BASE Layers: Simplifying Training of Large, Sparse Models. Mike Lewis, Shruti Bhosale, Tim Dettmers, Naman Goyal, Luke Zettlemoyer. 21
[moe model] Learning Large-scale Universal User Representation with Sparse Mixture of Experts. Caigao Jiang, Siqiao Xue, James Zhang, Lingyue Liu, Zhibo Zhu, Hongyan Hao. 22
[moe training and inference] SE-MoE: A Scalable and Efficient Mixture-of-Experts Distributed Training and Inference System. Liang Shen, Zhihua Wu, WeiBao Gong, Hongxiang Hao, Yangfan Bai, HuaChao Wu, Xinxuan Wu, Jiang Bian, Haoyi Xiong, Dianhai Yu, Yanjun Ma. 22
[diffusion transformer] Scalable Diffusion Models with Transformers. William Peebles, Saining Xie. 22
[long context] FlashAttention Fast and Memory-Efficient Exact Attention with IO-Awareness. Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré. 22
[moe] Mixture-of-Experts with Expert Choice Routing. Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew Dai, Zhifeng Chen, Quoc Le, James Laudon. 22
[kv cache management] Efficient Memory Management for Large Language Model Serving with PagedAttention. Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica. 23
[automatic KV cache reuse] Efficiently Programming Large Language Models using SGLang. Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Jeff Huang, Chuyue Sun, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, Ying Sheng. 23
[moe] Mixture of Experts with Uncertainty Voting for Imbalanced Deep Regression Problems. Yuchang Jiang, Vivien Sainte Fare Garnot, Konrad Schindler, Jan Dirk Wegner. 23
[batching LoRA inference] S-LoRA: Serving Thousands of Concurrent LoRA Adapters. Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, Joseph E. Gonzalez, Ion Stoica. 23
[reducing GPU memory] PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU. Yixin Song, Zeyu Mi, Haotong Xie, Haibo Chen. 23
[reducing communication volume] ZeRO++: Extremely Efficient Collective Communication for Giant Model Training. Guanhua Wang, Heyang Qin, Sam Ade Jacobs, Connor Holmes, Samyam Rajbhandari, Olatunji Ruwase, Feng Yan, Lei Yang, Yuxiong He. 23
[efficient generative llm inference] Splitwise: Efficient generative LLM inference using phase splitting. Pratyush Patel, Esha Choukse, Chaojie Zhang, Íñigo Goiri, Aashaka Shah, Saeed Maleki, Ricardo Bianchini. 23
[inference in flash] LLM in a flash: Efficient Large Language Model Inference with Limited Memory. Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, Karen Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, Mehrdad Farajtabar. 23
[efficient generative llm inference] SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills. Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Ramachandran Ramjee. 23
[efficient generative llm inference] SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference. Luciano Del Corro, Allie Del Giorno, Sahaj Agarwal, Bin Yu, Ahmed Awadallah, Subhabrata Mukherjee. 23
[efficient generative llm inference] Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads. Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Jiang Xu, Shuang Chen, Hao Feng, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan. 23
[model pruning] Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity. Haojun Xia, Zhen Zheng, Yuchao Li, Donglin Zhuang, Zhongzhu Zhou, Xiafei Qiu, Yong Li, Wei Lin, Shuaiwen Leon Song. 23
[kv cache] Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs. Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, Jianfeng Gao. 23
[model pruning] LLM-Pruner: On the Structural Pruning of Large Language Models. Xinyin Ma, Gongfan Fang, Xinchao Wang. 23
[model pruning] ZipLM: Inference-Aware Structured Pruning of Language Models. Eldar Kurtic, Elias Frantar, Dan Alistarh. 23
[model pruning] Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes. Lucio Dery, Steven Kolawole, Jean-François Kagy, Virginia Smith, Graham Neubig, Ameet Talwalkar. 23
[contextual sparsity] Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time. Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re, Beidi Chen. 23
[checkpoint] SWIFT: Expedited Failure Recovery for Large-scale DNN Training. Yuchen Zhong, Guangming Sheng, Juncheng Liu, Jinhui Yuan, Chuan Wu. 23
[checkpoint] Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates. Insu Jang, Zhenning Yang, Zhen Zhang, Xin Jin, Mosharaf Chowdhury. 23
[expert placement] FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement. Xiaonan Nie, Xupeng Miao, Zilong Wang, Zichao Yang, Jilong Xue, Lingxiao Ma, Gang Cao, Bin Cui. 23
[moe model] LoRAMoE: Revolutionizing Mixture of Experts for Maintaining World Knowledge in Language Model Alignment. Shihan Dou, Enyu Zhou, Yan Liu, Songyang Gao, Jun Zhao, Wei Shen, Yuhao Zhou, Zhiheng Xi, Xiao Wang, Xiaoran Fan, Shiliang Pu, Jiang Zhu, Rui Zheng, Tao Gui, Qi Zhang, Xuanjing Huang. 23
[moe model] Mixture of Experts with Uncertainty Voting for Imbalanced Deep Regression Problems. Yuchang Jiang, Vivien Sainte Fare Garnot, Konrad Schindler, Jan Dirk Wegner. 23
[moe fine-tuning] MOELoRA: An MOE-based Parameter Efficient Fine-Tuning Method for Multi-task Medical Applications. Qidong Liu, Xian Wu, Xiangyu Zhao, Yuanshao Zhu, Derong Xu, Feng Tian, Yefeng Zheng. 23
[moe inference] Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference. Haiyang Huang, Newsha Ardalani, Anna Sun, Liu Ke, Hsien-Hsin S. Lee, Anjali Sridhar, Shruti Bhosale, Carole-Jean Wu, Benjamin Lee. 23
[multi-modality] ImageBind-LLM: Multi-modality Instruction Tuning. Jiaming Han, Renrui Zhang, Wenqi Shao, Peng Gao, Peng Xu, Han Xiao, Kaipeng Zhang, Chris Liu, Song Wen, Ziyu Guo, Xudong Lu, Shuai Ren, Yafei Wen, Xiaoxin Chen, Xiangyu Yue, Hongsheng Li, Yu Qiao. 23
[moe] Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy. Pingzhi Li, Zhenyu Zhang, Prateek Yadav, Yi-Lin Sung, Yu Cheng, Mohit Bansal, Tianlong Chen. 23
[moe on edge] EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models. Rongjie Yi, Liwei Guo, Shiyun Wei, Ao Zhou, Shangguang Wang, Mengwei Xu. 23
[diffusion transformer] Photorealistic Video Generation with Diffusion Models. Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Li Fei-Fei, Irfan Essa, Lu Jiang, José Lezama. 23
[long context] Blockwise Parallel Transformer for Large Context Models. Hao Liu, Pieter Abbeel. 23
[inference serving system] SuperServe: Fine-Grained Inference Serving for Unpredictable Workloads. Alind Khare, Dhruv Garg, Sukrit Kalra, Snigdha Grandhi, Ion Stoica, Alexey Tumanov. 23
[kv cache] Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time. Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, Anshumali Shrivastava. 23
[inference serving system] Fast Distributed Inference Serving for Large Language Models. Bingyang Wu, Yinmin Zhong, Zili Zhang, Gang Huang, Xuanzhe Liu, Xin Jin. 23
[sparsity] LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation. Yixiao Li, Yifan Yu, Qingru Zhang, Chen Liang, Pengcheng He, Weizhu Chen, Tuo Zhao. 23
[sparsity] CHAI: Clustered Head Attention for Efficient LLM Inference. Saurabh Agarwal, Bilge Acun, Basil Homer, Mostafa Elhoushi, Yejin Lee, Shivaram Venkataraman, Dimitris Papailiopoulos, Carole-Jean. 24
[kv cache] DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving. Foteini Strati, Sara Mcallister, Amar Phanishayee, Jakub Tarnawski, Ana Klimovic. 24
[moe] OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models. Fuzhao Xue, Zian Zheng, Yao Fu, Jinjie Ni, Zangwei Zheng, Wangchunshu Zhou, Yang You. 24
[speculative decoding] Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads. Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, Tri Dao. 24
[speculative decoding] SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification. Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, Chunan Shi, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, Zhihao Jia. 24
[moe model] Mixtral of Experts. Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed. 24
[distributed kv cache] Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache. Bin Lin, Tao Peng, Chen Zhang, Minmin Sun, Lanbo Li, Hanyu Zhao, Wencong Xiao, Qi Xu, Xiafei Qiu, Shen Li, Zhigang Ji, Yong Li, Wei Lin. 24
[fast moe inference with cpu] FIDDLER: CPU-GPU ORCHESTRATION FOR FAST INFERENCE OF MIXTURE-OF-EXPERTS MODELS. Keisuke Kamahori, Yile Gu, Kan Zhu, Baris Kasikci.24
[fast moe inference with cpu] MOE-INFINITY: Activation-Aware Expert Offloading for Efficient MoE Serving. Leyang Xue, Yao Fu, Zhan Lu, Luo Mai, Mahesh Marina. 24
[kv cache] FastDecode: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines. Jiaao He, Jidong Zhai. 24
[kv cache] Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference. Harry Dong, Xinyu Yang, Zhenyu Zhang, Zhangyang Wang, Yuejie Chi, Beidi Chen. 24
[lora serving system] CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference. Suyi Li, Hanfeng Lu, Tianyuan Wu, Minchen Yu, Qizhen Weng, Xusheng Chen, Yizhou Shan, Binhang Yuan, Wei Wang. 24
[llm on mobile device] LLM as a System Service on Mobile Devices. Wangsong Yin, Mengwei Xu, Yuanchun Li, Xuanzhe Liu. 24
[inference serving system] MuxServe: Flexible Multiplexing for Efficient Multiple LLM Serving. Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng Zhang, Dahua Lin, Ion Stoica, Hao Zhang. 24
[kv cache] Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference. Muhammad Adnan, Akhil Arunkumar, Gaurav Jain, Prashant J. Nair, Ilya Soloveychik, Purushotham Kamath. 24
#tensor_parallelism Maximizing Parallelism in Distributed Training for Huge Neural Networks Zhengda Bian / Qifan Xu / Boxiang Wang / Yang You ArXiv·2021