Jaccard Coefficient

TL;DR

Quantifies similarity between two sets by comparing shared elements to total unique elements.
Computed as the size of the intersection divided by the size of the union.
Produces a single numeric similarity score (example result: 0.5).

Definition

The Jaccard coefficient is a measure of similarity between two sets of data. It is calculated by taking the intersection of the two sets (the number of elements that are common to both sets) and dividing it by the union of the two sets (the total number of unique elements in both sets).

Formally:

J(A, B) = \frac{|A \cap B|}{|A \cup B|}

Explanation

To compute the Jaccard coefficient for two sets A and B:

Determine A ∩ B, the elements common to both sets, and count them.
Determine A ∪ B, the unique elements across both sets, and count them.
Divide the count of the intersection by the count of the union to obtain the coefficient.

The result is a single numeric value representing how similar the two sets are based on shared versus total unique elements.

Examples

Example: Set A and Set B

Set A contains the elements 1, 2, and 3.
Set B contains the elements 2, 3, and 4.

Steps:

Intersection: the common elements are 2 and 3 (|A ∩ B| = 2).
Union: the unique elements are 1, 2, 3, and 4 (|A ∪ B| = 4).
Jaccard coefficient: 2/4, or 0.5.

Use cases

Natural language processing: compare similarity of two sentences by using the intersection of unique words in each sentence divided by the union of the unique words in both sentences.
Data mining: compare similarity of two datasets by taking the intersection of the unique data points in each dataset and dividing by the union of the unique data points.
Machine learning: apply the Jaccard coefficient to measure similarity between sets in various algorithms and analyses.