Cross Validation

TL;DR

Evaluate a model’s ability to generalize by repeatedly training and testing on different data subsets.
Reduces risk of overfitting by measuring performance on unseen data.
Enables comparison of models, algorithms, or hyperparameters on the same dataset.

Definition

Cross validation is a method used in machine learning to evaluate the performance of a model on unseen data. It involves dividing the available data into multiple subsets, training the model on one subset and testing it on the remaining subsets. This process is repeated multiple times, with each subset being used as the testing set at least once. The final model performance is then determined by averaging the results of all iterations.

Explanation

Cross validation measures how well a model generalizes by systematically partitioning the dataset into training and testing subsets, running multiple training/testing cycles, and computing an average performance across those cycles. By ensuring each subset serves as the testing set at least once, the method tests the model on different unseen portions of the data. Averaging results from all iterations gives a more reliable estimate of expected performance on new data and helps detect models that perform well on training data but poorly on unseen data (overfitting).

Examples

k-fold cross validation

In k-fold cross validation, the data is divided into k equal subsets. The model is trained on k-1 subsets and tested on the remaining subset. This process is repeated k times, with each subset serving as the testing set once. The final model performance is determined by averaging the results of all iterations.

Stratified k-fold cross validation

Stratified k-fold cross validation is similar to k-fold cross validation, but it ensures that the proportion of different classes in the data is preserved in each subset. This is particularly useful when dealing with imbalanced datasets, where one class is significantly more represented than the others.

Use cases

Evaluating model performance on unseen data to estimate generalization.
Avoiding overfitting by testing across multiple held-out subsets.
Comparing different models, algorithms, or hyperparameters on the same dataset to choose the best option.

Notes or pitfalls

Overfitting is a common problem in machine learning where a model performs well on training data but poorly on new data; cross validation helps detect and mitigate this by evaluating performance on unseen subsets.

k-fold cross validation
Stratified k-fold cross validation
Overfitting