Articles

Highlight Talks @NeurIPS 2023

Booths are under construction in the hall.

Here are the highlights of Talks from Day 1.

Optimizing LLM Inference: the topic comes up in 4 talks.

Databricks presents how to use first principles to optimize Transformer inference. Linden Li explains the steps of inference, "prefill" and "decode", and shows simple formulas to determine whether the inference job is memory-bound or compute-bound. He also shares three ideas in inference optimizations: (1) Reduce memory usage with Grouped Query Attention or PagedAttention; (2) Increase batch size with Orca Scheduler; and (3) Decode more tokens in parallel using Block-wise parallel decoding or speculative decoding.
Prof. Michael Schulte presents quantization with advanced data format, specifically the MX datatypes provided in the Brevitas Pytorch library. Brevitas supports both post-training quantization (PTQ) and quantization-aware training (QAT). Their experiments show that MXINT8 achieves comparable results with FP32 on GPT-175B/LAMMA-7B inference in the PTQ setting, whereas MXFP6 offers more savings and good performance in the QAT setting.
Ant Group touches upon Lookahead decoding, which is similar to speculative decoding. Their paper Lookahead: An Inference Acceleration Framework for Large Language Model with Lossless Generation Accuracy is under review. Dr. James Zhang also introduces their work on efficient inference via pruning and distillation and fast chain-of-thoughts.
AWS discusses quantization in the context of small models. In addition, they review the prominent families of < 13B models: Mistral, Orca, and Phi, as well as efficient training techniques like MultiLoRA.

Bayesian optimization for hyper-parameter tuning. In this workshop, Meta presents their Active Experimentation framework, AX. AX searches optimal hyper-parameters efficiently using Gaussian Processes, finding applications in architecture search and optimizing informal retrieval in recommender system.

Foundational translation model. Seamless-m4t from Meta demonstrates impressive real-time streaming translation capability.

Advertising software packages. AutoGen for Multi-Agent conversation. FiftyOne for image data visualization. Matlab for certifying ML systems in aviation.

The length of the abstract is a strong indicator of presentation quality. Tip from me: avoid talks with over-sized abstract. If the abstract cannot fit on your phone screen, you are better off chilling in the Canteen.

One more tip: don't order Chinese takeaways in New Orleans. Learned it the hard way :<

Deriving Speculative Sampling Intuitively

[538 words, 2-minute read] A family of lossless LLM inference acceleration techniques has been developed based on speculative sampling (review here). Proposed by Google and Deepmind, speculative sampling is the following three-step procedure: 1. Draft: a small model (draft model, \(p(\cdot|\text{context})\)) quickly generates a K-token draft. 2.

Retrieval-Augmented Image Synthesis: A Researcher's Guide

[1029 words, 4-minute read] Retrieval-Augmented Image Synthesis (RA-IS) first retrieves relevant images from a database and then generates the image grounding in the retrieved images. RA-IS has been shown to (1) enhance image quality; (2) guide image styles; and (3) help faithfully generate specific objects (e.g., The Oriental Pearl

Catch up on Speculative Decoding in 5 minutes: a survey for researchers as of December 2023

Speculative decoding speeds up LLM inference without any loss of generation quality. As of December 2023, researchers have reported ~2x speed-up from applying speculative decoding to 3B to 1T models. This survey explains the latest speculative decoding methods that enable lossless speed-up, examines reported experimental results, and suggests future research

Highlight Talks @NeurIPS 2023

Read next

Deriving Speculative Sampling Intuitively

Retrieval-Augmented Image Synthesis: A Researcher's Guide

Catch up on Speculative Decoding in 5 minutes: a survey for researchers as of December 2023

Comments ()

Read next

Comments ( )

Comments ()