Papers with Practical Values for Vision-Language Research @NeurIPS 2023 Day 5.

These 9 papers below offer practical solutions or guidance for vision-language research. I describe each work in 5 sentences.


Invited Talk: Systems and Foundation Models (FM). General-purpose FM solves niche problems such as data cleaning better than dedicated algorithms. Christopher Ré shares two directions to make FMs more efficient from a computer system perspective. (1) Speed-up the attention layer by minimizing GPU I/O. Flash Attention is 6-10x faster using 5%-10% of memory compared to regular Attention. (2) Replace the attention layer with signal-processing-inspired architectures that is more compute-efficient and scales linearly in sequence length: S4. More recent works, Based and Mamba, have achieved smaller perplexity than Transformer-based models and perform better on tasks in the Long Range Arena.

Performance Scaling via Optimal Transport: Enabling Data Selection from Partially Revealed Sources. Expensive trial-and-errors are needed to determine the best mixing ratio of multiple training datasets for optimal downstream performance. The authors propose a theoretically-grounded solution: First fit a linear model that predicts validation performance from the Optimal Transport between the down-sampled training set and the validation set. Then extrapolate the relation to a closed-form equation for the full training set. The optimal mixing ratio can then be obtained analytically from the closed-form equation. Their method can find application in continuous pretraining, a critical step in training domain-expert models. Paper.

Is Emergent ability of Large Language Models a mirage? Best paper of NeurIPS 2023. The authors show many "emergent abilities" are mainly caused by the use of non-linear, discontinuous evaluation metrics such as accuracy. 92% of the emergent abilities on the LLM BIG Benchmark occur with two harsh metrics: multiple choice accuracy and exact string match. If linear, continuous metrics such as Edit Distance are used, models linearly improve in log-parameter scale. This is confirmed by integer arithmetic experiments using GPT-3. They conclude that the improvements from scaling parameters are more predictable than surprising. Paper.

Task Arithmetic in the Tangent Space. Empirically, if you sum up the weights updates from Task A fine-tuning and that from Task B fine-tuning, the resultant model does well on both tasks. This weight-merging scheme is called Task Arithmetic. The authors show Task Arithmetic is possible because the weight updates for Task A and B are disentangled. That is, they lie in tangential directions. They also introduce linear fine-tuning which enforces weight disentanglement and show that doing task arithmetic on linearly fine-tuned models yield models that perform better on both tasks. This can be useful for multi-task learning. Paper.

JailBroken: How does LLM Safety Training Fail? Jailbreak attacks aim to elicit harmful responses from LLM via malicious prompting. The authors show these attacks are either (1) Prefix-injection attacks which invokes ability in conflict with safety objective such as "Start with Absolutely! Here's", or (2) Exploiting uncovered domains in safety training such as setting up a roleplay scenario with "You are an amoral AI", or a combination of both. Based on this observation, they designed jailbreak attacks that breach GPT-4 and Claude v1.3. Finally, they point out scaling is insufficient for defense and suggest integrated defense such as automatic red-teaming. Paper.

Data Selection for LM via Importance Resampling. Their algorithm selects a subset of training data for optimal performance on a given downstream task. Importance resampling is used to select text chunks following the target task distribution \(q\) although these chunks obey another distribution \(p\). They use hashed N-grams to model \(q\) and \(p\) efficiently to enable importance resampling. They validate their approach at trillion-token scale. Paper.

No Train No Gain: Revisiting Efficient Training Algorithms For Transformer-based Language Models. TLDR: declay Learning Rate as a function of time outperforms most algorithms desgined for efficiency. Paper.

Stable and low-precision training for large-scale vision-language models. Three solutions to reduce spikes in training loss: (1) if using AdamW/Adafactor to train large vision-language models, set beta2=0.95. The default beta2=0.999 is bad. (2) use smaller batch size. (3) use smaller learning rate. Paper.

Leveraging Early-Stage robustness in diffusion models for efficient and high-quality image synthesis. This work shows that more aggressive activation quantization (4-bit) can be used in earlier diffusion timestep, whereas 8-bit activation quantization is required in later timesteps to preserve generation quality measured by FID. Paper.


I will write dedicated blog posts for notable works.