Evaluation Metrics

TL;DR

Quantitative measures for assessing model or algorithm performance on a dataset.
Common examples include accuracy and F1 score.
F1 score accounts for false positives and false negatives and can be preferable when classes are imbalanced.

Definition

Evaluation metrics are measures used to evaluate the performance of a model or algorithm on a given dataset. They provide a way to assess the effectiveness of a model and can help determine which model is best suited for a given problem.

Explanation

Accuracy is the ratio of correct predictions made by a model to the total number of predictions made. It is simple and intuitive and often used as a measure of a model’s performance.
Accuracy can be misleading when class distributions are imbalanced: a model that predicts only the majority class can achieve high accuracy despite making no useful predictions for the minority class.
F1 score is the harmonic mean of precision and recall and accounts for both false positives and false negatives.
- Precision is the ratio of true positive predictions to all positive predictions.
- Recall is the ratio of true positive predictions to all actual positive examples.
The F1 score ranges from 0 to 1, with a higher value indicating better performance.

Examples

Accuracy example

If a model predicts the correct label for 90 out of 100 examples, its accuracy would be 90%.

Class imbalance example

If a dataset has 95% of examples belonging to one class and only 5% belonging to the other class, a model that always predicts the majority class would have a high accuracy, even though it is not making any useful predictions.

Precision and recall example

If a model makes 100 positive predictions, but only 80 of them are correct, its precision would be 80%.
If there are 100 positive examples in the dataset and the model only correctly predicts 80 of them, its recall would be 80%.

Use cases

Comparing and selecting models by assessing their effectiveness on a dataset.

Notes or pitfalls

Accuracy does not account for class imbalance and can give misleadingly high values when one class dominates.

Accuracy
F1 score
Precision
Recall
False positives
False negatives