High Dimensional Data

TL;DR

Applies to datasets with many features — for example, a dataset with “100 columns or features” up to genomic data with “millions of variables”.
High dimensionality makes analysis and interpretation more difficult and can invalidate assumptions used by traditional statistical methods.
Common ways to handle it are dimensionality reduction (e.g., PCA, SVD) and machine learning methods with regularization (e.g., random forests, gradient boosting, deep learning networks).

Definition

High dimensional data refers to data that has a large number of features or variables.

Explanation

High dimensional datasets contain many features per observation, in contrast to low dimensional data which has only a few features. As feature count grows, traditional statistical methods often become poorly suited or may break down because they rely on assumptions that no longer hold in high dimensional settings. To address these challenges, practitioners commonly reduce the number of features while retaining information (dimensionality reduction) or use algorithms designed to handle many features and incorporate regularization to reduce overfitting and improve interpretability.

Examples

Dataset with 100 features

A dataset with 100 columns or features would be considered high dimensional.

Genetic testing / genomes

A dataset containing the results of genetic testing for a group of individuals is high dimensional: each person’s genome contains millions of variables.

Consumer survey

A consumer survey dataset can be high dimensional when each respondent provides answers to dozens or even hundreds of questions.

Notes or pitfalls

Traditional statistical methods may be invalid or break down in high dimensional settings because their underlying assumptions may no longer hold.
High dimensional data increases the risk of overfitting; using models with regularization or dimensionality reduction can help mitigate this.

Low dimensional data
Dimensionality reduction
Principal component analysis (PCA)
Singular value decomposition (SVD)
Regularization
Random forests
Gradient boosting
Deep learning networks