[Paper Express] Data Selection for Language Models via Importance Resampling (DSIR)
README. Data Selection (DS) aims to select a given number of samples from a large, unlabeled dataset for training a capable model in a target domain. In the case of training langauge models, practical DS methods need to efficiently select from raw text corpus containing trillions of tokens. This paper,