Skip to content

Mapreduce

  • A programming model that processes very large datasets by splitting work across many machines.
  • Work is divided into a Map phase (produce intermediate key-value pairs) and a Reduce phase (aggregate those pairs).
  • Enables parallel, distributed computation for faster processing of large-scale data.

MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster. It is a framework for writing applications that can process vast amounts of data efficiently in parallel by dividing the data into smaller chunks and processing them concurrently.

The MapReduce model has two main phases:

  • Map phase: Input data is divided into smaller chunks and each chunk is processed independently. The mapping step transforms input records into a set of intermediate key-value pairs.
  • Reduce phase: The intermediate key-value pairs are processed and reduced to a smaller set of key-value pairs, aggregating or summarizing the intermediate results.

By emitting intermediate key-value pairs during the Map phase and aggregating them during the Reduce phase, MapReduce enables parallel processing and distributed computation across a cluster. It is commonly applied to large-scale data analysis tasks.

Suppose we have a large dataset of user records containing information such as user ID, name, email, and location. In the Map phase, process each record and map it to a key-value pair where the key is the user ID and the value is the user’s location. This produces a set of key-value pairs with each key representing a unique user ID and each value representing that user’s location.

In the Reduce phase, process the intermediate key-value pairs and reduce them to a smaller set where the key is the user’s location and the value is the number of users in that location. The result is a set of key-value pairs with each key representing a unique location and each value representing the number of users in that location.

Given a large dataset of tweets containing the tweet text, timestamp, and user: in the Map phase, process each tweet and map it to a key-value pair where the key is the tweet text and the value is the sentiment score (e.g. positive, negative, or neutral). This yields a set of key-value pairs with each key representing a unique tweet and each value representing that tweet’s sentiment score.

In the Reduce phase, process the intermediate key-value pairs and reduce them to a smaller set where the key is the sentiment score and the value is the number of tweets with that sentiment score. The result is a set of key-value pairs with each key representing a unique sentiment score and each value representing the number of tweets with that sentiment score.

  • Genomics
  • Machine learning
  • Web indexing
  • Map phase
  • Reduce phase
  • Key-value pairs
  • Parallel processing
  • Distributed computation