Test Set

TL;DR

A test set is a held-out portion of data used for the final evaluation of a model’s performance.
It must remain separate from training and validation data so the model has not seen it during model development.
The test set should be representative (often stratified) and sized appropriately to estimate generalization and detect overfitting.

Definition

A test set is a subset of a dataset that is used to evaluate the performance of a machine learning model. It is used to determine the accuracy and generalization ability of the model on unseen data. The test set is typically kept separate from the training and validation sets, as the model should not have seen any of the data in the test set during the training and validation process.

Explanation

During model development, data is commonly split into training, validation, and test sets. The model is trained on the training set and hyperparameters are tuned using the validation set. After these steps, the test set provides an unbiased assessment of final model performance because the model has not been fitted or tuned using that data. Using a separate test set helps ensure the model generalizes to new data and exposes overfitting that may have occurred during training.

The test set should reflect the real-world distribution the model will encounter. Stratifying the test set (keeping class proportions similar to the overall dataset) helps maintain representativeness. The appropriate size of a test set depends on dataset size and model complexity; larger datasets typically allow for larger test sets, but very large test sets may yield diminishing returns.

Examples

Example 1

Imagine you are building a machine learning model to classify whether an email is spam or not spam. You have a dataset of 10,000 emails, which you split into a training set (7,000 emails), a validation set (1,000 emails), and a test set (2,000 emails). You train your model on the training set and use the validation set to fine-tune the hyperparameters. Once you are satisfied with the performance of your model on the validation set, you use the test set to evaluate the model’s overall performance. This allows you to determine how well the model will perform on unseen data, and gives you a good idea of its generalization ability.

Example 2

Another example might be building a machine learning model to predict the price of a house based on various features such as location, size, number of bedrooms, etc. You have a dataset of 100,000 houses, which you split into a training set (70,000 houses), a validation set (10,000 houses), and a test set (20,000 houses). You train your model on the training set and use the validation set to fine-tune the hyperparameters. Once you are satisfied with the performance of your model on the validation set, you use the test set to evaluate the model’s overall performance. This allows you to determine how well the model will perform on unseen data, and gives you a good idea of its generalization ability.

Notes or pitfalls

A separate test set helps prevent and detect overfitting: if a model performs well on training/validation but poorly on the test set, it has likely overfitted the training data.
Ensure the test set is representative of real-world data; stratifying by class proportions (e.g., spam vs. not spam) is commonly recommended.
Test set size should balance the need for reliable evaluation with the need for sufficient training data; larger datasets permit larger test sets, but overly large test sets may offer limited additional insight.

Training set
Validation set
Overfitting
Generalization
Stratification