(Vision-Language Researcher) Selected Papers @NeurIPS 2023

Here are some papers that we, who mainly work on vision-language models, think are interesting on Day 4 of NeurIPS 2023.


Tree of Thoughts: Deliberate Problem Solving with Large Language Models. Tree of Thoughts (ToT) is a decoding scheme for auto-regressive Transformer. A thought is defined as a coherent piece of text. For example, in the Game of 24 (figuring out arithmetics to make 24 from 4 numbers), a thought is a line of equation. Instead of generating one token at a time, ToT generates one thought at a time and then conduct beam search over thought candidates. With ToT, Large Language Models solves the Game of 24 70% of the time, whereas with Chain-of-Thought (CoT) success rate is only 5%. Paper.

Faster Causal Attention Over Large Sequences Through Sparse Flash Attention. In a learned attention layer, the attention score is often dominated by the dot products of a few query and key vector-pairs. This work identifies these dominant query-key pairs with locality-sensitive hashing, arranges these pairs such that computing the attention scores is reduced to evaluating a few block matrice, and uses Flash Attention to speed up the computation. They report 2x convergence speed without loss of performance compared to Flash Attention. This summary has been reviewed by the first author ✔️. Paper.

Self-Chained Image-Language Model for Video Localization and Question Answering. This works tackles video localization and video question answering simultaneously with a single Image-LM architecture. They retrieve key frames in the video and then generate answers conditioning on these frames via LLM. They are the "previous SoTA" in video question answering cited in the Google Gemini report. This summary has been reviewed by the first author ✔️ Paper. 5-min talk.

Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering. Our work. We are interested in retrieving knowledge to better answer image-based questions. Previous approch, Dense Passage Retrieval (DPR), uses a single vector to represent query/document. We use all token embeddings as representation, outperforming DPR by large margin. Read our three-minute technical blog which was a big success at the poster session. Paper.

Understanding the Latent Space of Diffusion Models through the Lens of Riemannian Geometry. Their analysis of the principal component of Diffusion model's latent space verifies three intuititions: (1) At earlier timestep, the principal component correspond to low-frequency structure, whereas at later timestep it controls high-frequency structure. (2) Similar text prompt induces similar feature space. (3) The generative process depends less on text conditions in later timesteps. This summary has been reviewed by the first author ✔️. Full insights are in the paper.


I hope you also find these interesting. Subscribe for more.