Feature Engineering

TL;DR

Use domain knowledge to create or transform inputs (features) from raw data to help models learn better.
Can involve creating new features (e.g., ratios) or removing irrelevant ones via feature selection.
A critical step in the data science process that can significantly affect model accuracy and complexity.

Definition

Feature engineering is the process of using domain knowledge to extract features from raw data that can be used to improve the performance of machine learning algorithms.

Explanation

Feature engineering is a step in the data science workflow where practitioners apply their understanding of the problem domain to produce inputs (features) that are more informative for a model. This includes creating new features from existing data and selecting the most relevant features from a larger set. Effective feature engineering can reduce model complexity and substantially influence a model’s accuracy and effectiveness.

Examples

Creating new features from existing data

In predicting housing prices, a data scientist might start with square footage and number of bedrooms as inputs. By applying domain knowledge, they can create a new feature representing the ratio of square footage to the number of bedrooms. That ratio conveys relative size information and may improve model performance.

Feature selection to identify relevant features

Datasets often contain many features that are not useful for a specific problem. Feature selection techniques—such as mutual information or chi-squared tests—can identify the most important features and exclude irrelevant ones, helping to reduce model complexity and improve performance.

Feature selection
Mutual information
Chi-squared tests
Domain knowledge
Machine learning
Data science process