Deriving Speculative Sampling Intuitively
[538 words, 2-minute read]
A family of lossless LLM inference acceleration techniques has been developed based on speculative sampling (review here). Proposed by Google and Deepmind, speculative sampling is the following three-step procedure:
1. Draft: a small model (draft model, \(p(\cdot|\text{context})\)) quickly generates a K-token draft.
2.