Highlight Talks @NeurIPS 2023

Highlight Talks @NeurIPS 2023
Booths are under construction in the hall.

Here are the highlights of Talks from Day 1.

Optimizing LLM Inference: the topic comes up in 4 talks.

  • Databricks presents how to use first principles to optimize Transformer inference. Linden Li explains the steps of inference, "prefill" and "decode", and shows simple formulas to determine whether the inference job is memory-bound or compute-bound. He also shares three ideas in inference optimizations: (1) Reduce memory usage with Grouped Query Attention or PagedAttention; (2) Increase batch size with Orca Scheduler; and (3) Decode more tokens in parallel using Block-wise parallel decoding or speculative decoding.
  • Prof. Michael Schulte presents quantization with advanced data format, specifically the MX datatypes provided in the Brevitas Pytorch library. Brevitas supports both post-training quantization (PTQ) and quantization-aware training (QAT). Their experiments show that MXINT8 achieves comparable results with FP32 on GPT-175B/LAMMA-7B inference in the PTQ setting, whereas MXFP6 offers more savings and good performance in the QAT setting.
  • Ant Group touches upon Lookahead decoding, which is similar to speculative decoding. Their paper Lookahead: An Inference Acceleration Framework for Large Language Model with Lossless Generation Accuracy is under review. Dr. James Zhang also introduces their work on efficient inference via pruning and distillation and fast chain-of-thoughts.
  • AWS discusses quantization in the context of small models. In addition, they review the prominent families of < 13B models: Mistral, Orca, and Phi, as well as efficient training techniques like MultiLoRA.

Bayesian optimization for hyper-parameter tuning. In this workshop, Meta presents their Active Experimentation framework, AX. AX searches optimal hyper-parameters efficiently using Gaussian Processes, finding applications in architecture search and optimizing informal retrieval in recommender system.

Foundational translation model. Seamless-m4t from Meta demonstrates impressive real-time streaming translation capability.

Advertising software packages. AutoGen for Multi-Agent conversation. FiftyOne for image data visualization. Matlab for certifying ML systems in aviation.

The length of the abstract is a strong indicator of presentation quality. Tip from me: avoid talks with over-sized abstract. If the abstract cannot fit on your phone screen, you are better off chilling in the Canteen.

One more tip: don't order Chinese takeaways in New Orleans. Learned it the hard way :<