High Dimensional Data
- Applies to datasets with many features — for example, a dataset with “100 columns or features” up to genomic data with “millions of variables”.
- High dimensionality makes analysis and interpretation more difficult and can invalidate assumptions used by traditional statistical methods.
- Common ways to handle it are dimensionality reduction (e.g., PCA, SVD) and machine learning methods with regularization (e.g., random forests, gradient boosting, deep learning networks).
Definition
Section titled “Definition”High dimensional data refers to data that has a large number of features or variables.
Explanation
Section titled “Explanation”High dimensional datasets contain many features per observation, in contrast to low dimensional data which has only a few features. As feature count grows, traditional statistical methods often become poorly suited or may break down because they rely on assumptions that no longer hold in high dimensional settings. To address these challenges, practitioners commonly reduce the number of features while retaining information (dimensionality reduction) or use algorithms designed to handle many features and incorporate regularization to reduce overfitting and improve interpretability.
Examples
Section titled “Examples”Dataset with 100 features
Section titled “Dataset with 100 features”A dataset with 100 columns or features would be considered high dimensional.
Genetic testing / genomes
Section titled “Genetic testing / genomes”A dataset containing the results of genetic testing for a group of individuals is high dimensional: each person’s genome contains millions of variables.
Consumer survey
Section titled “Consumer survey”A consumer survey dataset can be high dimensional when each respondent provides answers to dozens or even hundreds of questions.
Notes or pitfalls
Section titled “Notes or pitfalls”- Traditional statistical methods may be invalid or break down in high dimensional settings because their underlying assumptions may no longer hold.
- High dimensional data increases the risk of overfitting; using models with regularization or dimensionality reduction can help mitigate this.
Related terms
Section titled “Related terms”- Low dimensional data
- Dimensionality reduction
- Principal component analysis (PCA)
- Singular value decomposition (SVD)
- Regularization
- Random forests
- Gradient boosting
- Deep learning networks