Efficient Learning Orals @NeurIPS 2023
We will walk through 3 oral presentations on Efficient Learning at NeurIPS 2023 in 2 minutes.
Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture
Presentor: Dan Fu@Stanford. Paper.
Attention and MLP layers scale quadratically in sequence length and model dimension. This work proposes Monarch Mixer, an alternative architecture with sub-quadratic scaling property that substitutes Transformer Attentions and MLPs layers. M2-BERT, the BERT model with Attention and MLP layers replaced by Monarch Mixer, achieves same GLEU performance as the original BERT with 27% less parameters while having longer (8K) context length. They observe similar gains in M2-ViT and M2-GPT experiments.
QLoRA: Efficient Fine-tuning of Quantized LLMs
Presentor: Tim Dettmers@University of Washington. Paper.
QLoRA requires even less memory than LoRA using Codebook Quantization. They propose 4-bit NormalFloat (NF), an information-theoretically optimal data type for normal distributions. They show QLoRA makes finetuning 18x cheaper, and that NF4-quantized system performs on par with FP16 systems. Using QLoRA, they build Guanaco, a "ChatGPT-quality" 4-bit chatbot finetuned in 24h on a single GPU. QLoRA is available through the bitsandbytes library or the HuggingFace Transformers.
Scaling Data-Constrained Language Models
Presentor: Niklas Muennighoff@HuggingFace. Paper.
High-quality text data, books and papers, will be exhausted by next year if LLM scales up at the current rate. They share three findings in this data-constrained regime: (1) Repeat training data moderately yields better model. Although previous work (GPT-3, PaLM) advises against training more than one epoch, they find 4 epochs offer good empirical results. More repetition hurts performance. (2) Add in 50% of code data helps improve natural language understanding. (3) Quality-filtering is important for repeated data.
Let me know if you find this summary helpful!
Comments ()