Jinghong Chen

Sign in Subscribe

Articles

Focused pieces on ML/NLP/AIGC topics.

Deriving Speculative Sampling Intuitively

[538 words, 2-minute read] A family of lossless LLM inference acceleration techniques has been developed based on speculative sampling (review here). Proposed by Google and Deepmind, speculative sampling is the following three-step procedure: 1. Draft: a small model (draft model, \(p(\cdot|\text{context})\)) quickly generates a K-token draft. 2.

Retrieval-Augmented Image Synthesis: A Researcher's Guide

[1029 words, 4-minute read] Retrieval-Augmented Image Synthesis (RA-IS) first retrieves relevant images from a database and then generates the image grounding in the retrieved images. RA-IS has been shown to (1) enhance image quality; (2) guide image styles; and (3) help faithfully generate specific objects (e.g., The Oriental Pearl

Catch up on Speculative Decoding in 5 minutes: a survey for researchers as of December 2023

Speculative decoding speeds up LLM inference without any loss of generation quality. As of December 2023, researchers have reported ~2x speed-up from applying speculative decoding to 3B to 1T models. This survey explains the latest speculative decoding methods that enable lossless speed-up, examines reported experimental results, and suggests future research

Estimate LLM inference speed and VRAM usage quickly: with a Llama-7B case study

You can estimate Time-To-First-Token (TTFT), Time-Per-Output-Token (TPOT), and the VRAM (Video Random Access Memory) needed for Large Language Model (LLM) inference in a few lines of calculation. I will show you how with a real example using Llama-7B. LLM Inference Basics LLM inference consists of two stages: prefill and decode.

Papers with Practical Values for Vision-Language Research @NeurIPS 2023 Day 5.

These 9 papers below offer practical solutions or guidance for vision-language research. I describe each work in 5 sentences. Invited Talk: Systems and Foundation Models (FM). General-purpose FM solves niche problems such as data cleaning better than dedicated algorithms. Christopher Ré shares two directions to make FMs more efficient from

(Vision-Language Researcher) Selected Papers @NeurIPS 2023

Here are some papers that we, who mainly work on vision-language models, think are interesting on Day 4 of NeurIPS 2023. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. Tree of Thoughts (ToT) is a decoding scheme for auto-regressive Transformer. A thought is defined as a coherent piece

Efficient Learning Orals @NeurIPS 2023

We will walk through 3 oral presentations on Efficient Learning at NeurIPS 2023 in 2 minutes. Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture Presentor: Dan Fu@Stanford. Paper. Attention and MLP layers scale quadratically in sequence length and model dimension. This work proposes Monarch Mixer, an alternative architecture with sub-quadratic