The Six Elements of NLP Experiments. Part 1: Datasets
Experiments are the foundations of Natural Languge Processing (NLP). Just like experimental sciences, we test our proposed methods and verify our theories with experiments. Describing, planning, and conducting experiments are therefore the core skills of every NLP researcher.
This article concerns describing NLP experiments.
There are six basic elements that compose almost all NLP experimental recipes: (1) Datasets; (2) Models; (3) Training procedure; (4) Inference procedure; (5) Evaluation method and (6) Compute. By understanding them, you will be able to describe an experimental setup completely for replication or for detailed reporting.
We start with the first element: Datasets.
Learning Objectives
- You will learn about the three key components of an NLP dataset.
- You will learn a description formula to describe an NLP dataset in bullet points effectively.
- You will see example descriptions of real datasets by applying the description formula from my own experience. These include complex datasets such as the TL;DR dataset by OpenAI and the UltraFeedback datasets, etc.
- You will understand why a principled description improves clarity with an example from my past presentation.
The Three Key Aspects
When we talk about datasets as an NLP researcher, there are but a few things that really matter.
$$\text{Dataset}=\text{Task} + \text{Organization} + \text{Annotation}$$
Let me explain.
Task. A dataset is usually curated around a specific task, that is, a procedure that the model should learn to produce the desired output given the input. Most tasks have standard names. Here are some example datasets and their tasks:
- The TL;DR dataset is a summmarization dataset. The task is to summarize reddit post into short summaries with less than 50 words in general.
- The WMT21 dataset is a Neural Machine Translation dataset. The task is to translate a sentence from the source langauge (e.g., English) to the target langauge (e.g., Chinese).
- The IMDB dataset is a semantic analysis dataset. The task is to determine whether a review holds positive views or negative views.
Organization. The data in a dataset is usually organized into multiple subsets or splits which are reserved for different purposes. Most datasets are organized into splits. The most common split configuration is the three-way partition: “train”, “validation”, and “test”.
- The "train" split is for training models. Think of it as the “exercise book” where the correct answers (known as “ground-truth”, “gold reference”, or “target”) are provided.
- The "validation" split is for validating that the training procedures work as expected and for selecting the best model produced from training. Think of it as a set of "mock exams" you are allowed to test your models on.
- The "test" split is for evaluating the final performance and should be strictly held-out in developing the model. Think of it as the "final exam" that scores your model. You are of course forbidden to peek the final exam in development, otherwise you will be cheating!
For more complicated datasets such as benchmarks consisting of many tasks (e.g., multi-modal knowledge retrieval benchmark like M2KR), the dataset could have many subsets, each with different split configurations. We will use TL;DR as an example later to explain how to describe datsets with subset organization.
Don't worry, multiple subsets with individual splits configurations is as complicated as it gets for NLP dataset organization.
Annotation. Datasets usually provide annotations (i.e., labels/scores from human or AI annotators) for the purpose of training and evaluation. For example, the E-VQA dataset contains annotation of the ground-truth document needed to answer the question. This makes it easy to evaluate retriever's performance.
It is important to make clear what is annotated in the dataset and by who, as an increasing number of works use AI as annotators.
Bullet-Point Description Formula
Below is the bullet-point formula for describing an NLP dataset.
- DatasetName is a TaskName dataset. The task is to TaskDescription.
- The dataset contains m splits: Split1-Name , …, SplitK-Name.
- The Split1-Name split (Split1-Size) contains DataType annotated by Annotator.
- <Repeat for all splits>
- Link: <Link to dataset>
When the dataset organization consists of subsets and splits, we describe the higher-level subset first and then the lower-level splits in a top-down manner.
Examples
Simple Example: IMDB
let's first see how to apply the formula to describing datasets, starting with the relatively simple IMDB dataset to the more challenging TL;DR dataset.
- IMDB is a semantic analysis dataset. The task is to determine whether a review holds positive or negative views towards the movie.
- The dataset contains three splits: "train", "test", and "unsupervised" (50k).
- The "train" (25k) and "test" (25k) splits contain reviews and their sentiments annotated by human annotators.
- The "unsupervised" split (50k) contains reviews only. There is no annotation.
- Link: https://huggingface.co/datasets/stanfordnlp/imdb
You can see that the bullet-point formula is not followed exactly. The formula is a guide. If followed, it ensures that you will not miss anything major in your description. You can of course use common sense to improve the readability of your description. In this case, I merge the "train" and "test" splits into one bullet point in the description.
Challenging Example: TL;DR
Now let's apply the formula to the much more challenging TL;DR dataset. Note how the description of subsets and splits proceeds in a top-down manner.
- TL;DR is a summarization dataset. The task is to generate short summaries (generally <50 words) of much longer reddit posts (generally >250 words).
- It consists of two subsets: "comparisons" and "axis".
- The "comparisons" subset contains pairs of preferred and dispreferred summaries with preferences annotated by human.
- This subset has three splits: "train", "valid-1" and "valid-2".
- The "train" (92.9K pairs) split contains pairs of preferred and dispreferred summaries.
- The "valid1" (51.6K pairs) split contains pairs of preferred and dispreferred summaries. It also contains a confidence field indicating how confident the preference is. The split is used for validation.
- The "valid2" (34.4K pairs) split has the same data and annotation scheme as "valid1". The split is for held-out evaluation (like a "test" split).
- The "axis" subset contains summaries with human-annotated Likert scores on a scale from 1 to 10 concerning the quality of the summary on four dimensions.
- This subset has two splits: "validation" and "test".
- The "validation" split contains 8.6K annotated summaries. These summaries are taken from the "valid1" split in the "comparison" subset. The annotation includes Likert scores on four dimensions. The four dimensions are "overall", "accuracy", "coverage", and "coherence". Likert scores are given by human on a scale from 1 to 10.
- The "test" split contains 6.3K summaries taken from the "valid2" split in the "comparison subset. These summaries are annotated similarly as that in the "validation split"
- Link: https://huggingface.co/datasets/openai/summarize_from_feedback
In describing more complicated and less standard datasets like TL;DR, use common sense to decide whehther and what to expand on DataType. For example, f0r the "axis" subset, I supply additional information about the where these summaries are taken from and the annotation details. To do this, I simply write one short sentence for each additional information.
The English lanaguage offers clauses, adjectives, and parentheses etc. for providing additional information. You are free to use whatever suits your style. For example, below is another way to describe the "validation" split in the "axis" subset:
- The "validation" split contains 8.6K summaries taken from the "valid1" split in the "comparison" subset, each labeled with human-annotated likert scores (i.e., scores on a scale from 1 to 10) on four dimensions ("overall", "accuracy", "coverage", and "coherence").
A principled way of getting to a correct and clear description is to always start with the "one short setence per point" approach and then rephrase for conciseness.
Comparison with "Free-Form" Description
Let me show you how the above principled approach can make a clearer description than describing a dataset "freely" from my own experience.
In my second year of PhD (2023), I looked at preference optimization, a topic that gained a lot of traction at the time with a focus on learning from pair-wise comparison data (i.e., pairs of "preferred" and "dispreferred" responses).
To give you an idea of the complexity of these datasets' development history: the UltraFeedback (UF) dataset was a common source of comparison data and people have developed many "variants" of UF. For example, the HuggingFace team came up with the UltraFeedback Binarized (UFB) varient by trimming UF into pairs of strongly-preferred and strongly-dispreferred responses; The Snorkel team, on the other hand, published the "Snorkel-Mistral-PairRM-DPO-Dataset" which was named nothing like a variant of UltraFeedback. But the dataset was in fact synthesized by taking the prompts in UF, generating responses with the Mistral model, and labeling preferences with the PairRM reward model.
Having researched all this, below is how I described the UF dataset to my supervisor:
The UltraFeedback (UF) Dataset
- Overview UF consists of 64k prompts. Each prompt has four model completions sampled from different LLMs and is rated by GPT-4 on four dimensions: instruction-following, truthfulness, honesty, and helpfulness. UF is originally designed to train reward or critic model.
- Data Curation
- 64k prompts are collected from diverse resources (including UltraChat, ShareGPT, Evol-Instruct, TruthfulQA, FalseQA, and FLAN, see here for dataset statistics).
- For each prompt, multiple LLMs are used to generate 4 different responses, resulting in 256k samples. See here for model lists.
- Ask GPT-4 to annotate the collected samples on 4 aspects, namely instruction-following, truthfulness, honesty and helpfulness.
- Usage in the Literature UF is a popular data source of comparison pairs in the LLM alignment literature.
- Zephyr-7B is DPO-trained on UltraFeedback-Binarized (UFB), a binarized version of UF which takes the response with the highest overall score as the chosen response and randomly select one from the remaining responses as the rejected response.
- Notus, another fine-tuned version of Zephyr-7B-SFT, uses a different strategy to binarize UF to collect preference pairs. Instead of using the overall score, they binarize based on the mean score of the 4 aspects in UF. The resulting dataset is here.
- Implication for our work We can employ other binarization strategies to test out our DPO variants.
- Known issues There are a few hundred of completions with incorrect labels and several prompts were sourced from the TruthfulQA benchmark which can lead to contamination with public leaderboards. Both issues have been resolved in HF’s UFB dataset here.
You can see why the presentation did not go well: I tried to achieve too many things all at once. For example, the description includes not one but three datasets: UF, UFB, and Notus, all under the subheading of UF. Although the latter two are derived from UF, mentioning them here is not helpful for understanding the UF dataset nor sufficient for understanding UFB and Notus themselves.
Another problem is that subjective speculation ("Implication for our work") is mixed with objective description. This is a common mistakes of entry-level researchers which can cause confusion. In this case, the speculation is vague. What exactly are "other binarization strategies" and how exactly can we use them to test our proposed methods? These are burning questions warranted by the speculation but not answered. Not to mention that they are far-off the topic of describing UltraFeedback dataset and so should not be put under this sub-heading in the first place.
Let's compare the free-form description above with the principled description below:
The UltraFeedback (UF) Dataset
- UltraFeedback is a preference dataset. The task is for the model to learn to generate responses that are aligned with human preference to general prompts and instructions.
- The dataset contains just one split: "train".
- The "train" split has 64k rows. Each row contains an instruction, a list of completions generated from 4 models. The names of the models are given in the "model" field. Each completion is annotated by GPT-4 on four aspects: instruction-following, truthfulness, honesty and helpfulness.
- Link: https://huggingface.co/datasets/openbmb/UltraFeedback
Applying the formula yields a focused and clear description of UltraFeedback. Compared to the description in my presentation, this version more effectively communicates the key aspects of UF and sets the ground-work for further discussion without introducing any ambiguity.
Of course, information like "known issues" can be very important and it was good to spot it in reviewing the dataset. You should definitely keep this information in your own note. But think twice when you include it in your presentation. A piece of text should only make one focused point. With the principle, it may be better to raise "known issues" when discussing how to use the UF for training models.
Conclusion
In this article, I explained the three key elements of an NLP dataset and a bullet-point formula for writing effective description. I demonstrated how to apply the formula with a simple example (IMDB) and a challenging example (TL;DR). Finally, I showed how this principled, formula-based approach to describing datasets can improve clarity compared to free-form description with a real example from my past presentation. By now, you have all the knowledge and tools to describe an NLP dataset without ambiguity.
This is an article of the Entering NLP Research series I write to get aspiring students started in NLP research. Consider subscribing to this website if you find this helpful.
Comments ()