Bagging
- Train many models on different bootstrap samples (sampling with replacement) of the original dataset.
- Combine model outputs (e.g., averaging or majority voting) to produce a final prediction.
- Reduces the effect of variance, noise, and overfitting, improving accuracy and robustness.
Definition
Section titled “Definition”Bagging, also known as bootstrap aggregation, is a machine learning technique that involves training multiple models on different subsets of the same dataset and then combining their predictions to improve the overall performance of the model.
Explanation
Section titled “Explanation”Bagging creates multiple subsets of the original dataset using bootstrapping (sampling with replacement). Each subset is used to train an independent model (for example, a decision tree). Because the bootstrap samples differ, each model learns slightly different patterns. To make a prediction for new data, the new input is passed to every trained model and their predictions are combined into a single final prediction. Common combination methods include averaging continuous predictions or using a voting scheme for categorical outcomes. Bagging is particularly useful for complex datasets with high variance or noise because it reduces the impact of those factors and helps prevent overfitting by ensuring each model is trained on a somewhat different set of data.
Examples
Section titled “Examples”House price prediction example
Section titled “House price prediction example”Suppose we want to predict the price of a house using features such as square footage, number of bedrooms, and location. We have a dataset of 100 houses. Using bagging:
- Create multiple bootstrap samples of the 100-house dataset (sampling with replacement; some data points may appear multiple times in a sample, others may not appear).
- Train a model (for example, a decision tree) on each bootstrap sample.
- For new houses, obtain predictions from each model and combine those predictions (e.g., average the predicted prices or use a voting scheme) to produce a final predicted price.
Use cases
Section titled “Use cases”- Datasets that are complex or contain a high degree of variance or noise.
- Situations where reducing overfitting and improving robustness and accuracy of predictions is desired.
Notes or pitfalls
Section titled “Notes or pitfalls”- Bootstrapping samples with replacement means some data points may be included in more than one subset while others may be excluded entirely.
- Common methods to combine model outputs are averaging (for regression) and majority voting (for classification).
Related terms
Section titled “Related terms”- Bootstrap aggregation (same as Bagging)
- Bootstrapping (sampling with replacement)
- Decision tree
- Averaging (prediction aggregation)
- Voting scheme (prediction aggregation)