Articles

Efficient Learning Orals @NeurIPS 2023

We will walk through 3 oral presentations on Efficient Learning at NeurIPS 2023 in 2 minutes.

Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture

Presentor: Dan Fu@Stanford. Paper.

Attention and MLP layers scale quadratically in sequence length and model dimension. This work proposes Monarch Mixer, an alternative architecture with sub-quadratic scaling property that substitutes Transformer Attentions and MLPs layers. M2-BERT, the BERT model with Attention and MLP layers replaced by Monarch Mixer, achieves same GLEU performance as the original BERT with 27% less parameters while having longer (8K) context length. They observe similar gains in M2-ViT and M2-GPT experiments.

QLoRA: Efficient Fine-tuning of Quantized LLMs

Presentor: Tim Dettmers@University of Washington. Paper.

QLoRA requires even less memory than LoRA using Codebook Quantization. They propose 4-bit NormalFloat (NF), an information-theoretically optimal data type for normal distributions. They show QLoRA makes finetuning 18x cheaper, and that NF4-quantized system performs on par with FP16 systems. Using QLoRA, they build Guanaco, a "ChatGPT-quality" 4-bit chatbot finetuned in 24h on a single GPU. QLoRA is available through the bitsandbytes library or the HuggingFace Transformers.

Scaling Data-Constrained Language Models

Presentor: Niklas Muennighoff@HuggingFace. Paper.

High-quality text data, books and papers, will be exhausted by next year if LLM scales up at the current rate. They share three findings in this data-constrained regime: (1) Repeat training data moderately yields better model. Although previous work (GPT-3, PaLM) advises against training more than one epoch, they find 4 epochs offer good empirical results. More repetition hurts performance. (2) Add in 50% of code data helps improve natural language understanding. (3) Quality-filtering is important for repeated data.

Let me know if you find this summary helpful!

Deriving Speculative Sampling Intuitively

[538 words, 2-minute read] A family of lossless LLM inference acceleration techniques has been developed based on speculative sampling (review here). Proposed by Google and Deepmind, speculative sampling is the following three-step procedure: 1. Draft: a small model (draft model, \(p(\cdot|\text{context})\)) quickly generates a K-token draft. 2.

Retrieval-Augmented Image Synthesis: A Researcher's Guide

[1029 words, 4-minute read] Retrieval-Augmented Image Synthesis (RA-IS) first retrieves relevant images from a database and then generates the image grounding in the retrieved images. RA-IS has been shown to (1) enhance image quality; (2) guide image styles; and (3) help faithfully generate specific objects (e.g., The Oriental Pearl

Catch up on Speculative Decoding in 5 minutes: a survey for researchers as of December 2023

Speculative decoding speeds up LLM inference without any loss of generation quality. As of December 2023, researchers have reported ~2x speed-up from applying speculative decoding to 3B to 1T models. This survey explains the latest speculative decoding methods that enable lossless speed-up, examines reported experimental results, and suggests future research

Efficient Learning Orals @NeurIPS 2023

Read next

Deriving Speculative Sampling Intuitively

Retrieval-Augmented Image Synthesis: A Researcher's Guide

Catch up on Speculative Decoding in 5 minutes: a survey for researchers as of December 2023

Comments ()

Read next

Comments ( )

Comments ()