Skip to content

Feature Engineering

  • Use domain knowledge to create or transform inputs (features) from raw data to help models learn better.
  • Can involve creating new features (e.g., ratios) or removing irrelevant ones via feature selection.
  • A critical step in the data science process that can significantly affect model accuracy and complexity.

Feature engineering is the process of using domain knowledge to extract features from raw data that can be used to improve the performance of machine learning algorithms.

Feature engineering is a step in the data science workflow where practitioners apply their understanding of the problem domain to produce inputs (features) that are more informative for a model. This includes creating new features from existing data and selecting the most relevant features from a larger set. Effective feature engineering can reduce model complexity and substantially influence a model’s accuracy and effectiveness.

In predicting housing prices, a data scientist might start with square footage and number of bedrooms as inputs. By applying domain knowledge, they can create a new feature representing the ratio of square footage to the number of bedrooms. That ratio conveys relative size information and may improve model performance.

Feature selection to identify relevant features

Section titled “Feature selection to identify relevant features”

Datasets often contain many features that are not useful for a specific problem. Feature selection techniques—such as mutual information or chi-squared tests—can identify the most important features and exclude irrelevant ones, helping to reduce model complexity and improve performance.

  • Feature selection
  • Mutual information
  • Chi-squared tests
  • Domain knowledge
  • Machine learning
  • Data science process