Abstract

This is a PDF of a notion document for Homework 2 of the Cloud_Computing-Spring-2025 course at Shiraz University. The full notion document is available at the below link:

https://www.notion.so/Cloud-Computing-HW2-1c8fff288ff18035bc00c236285a6e3f?pvs=4

Part 1

Definition and explanation of core concepts

Here are some explanations regarding different concepts in map reduce:

Map

The Map phase takes an input dataset and applies a function to each individual element, transforming it into intermediate key-value pairs. The input data for the map function can be in the form of files (e.g., text, binary, etc.). These files are split into chunks (blocks), and each chunk is processed by different map tasks running in parallel. The result of the Map function is a set of key-value pairs, typically unordered. Each key can appear multiple times with different associated values.

Shuffle

The Shuffle phase sorts and groups the key-value pairs output by the Map phase. It essentially involves redistributing the data based on keys to ensure that all the values associated with the same key are grouped together. his phase involves significant data transfer between different machines in the cluster (from map tasks to reduce tasks). This is often the most resource-intensive phase of a MapReduce job. Shuffling can be accomplished by hashing part of the data and distributing the resulted hash across a series of nodes that are going to perform the reduce phase.

Reduce

The Reduce phase takes the grouped key-value pairs from the shuffle phase and applies a function to aggregate them. Each reducer processes all the values for a particular key.

Hadoop: Core Modules

HDFS

HDFS is the storage layer of Hadoop, designed to store vast amounts of data across a distributed environment. It splits large files into fixed-size blocks and stores these blocks across multiple machines in a cluster.

Core Concepts:

Blocks: Data is divided into large blocks (typically 128 MB or 256 MB), and these blocks are distributed across the cluster.
Replication: Data blocks are replicated across multiple nodes to ensure fault tolerance. By default, each block is replicated 3 times.
NameNode and DataNode:
- NameNode: It is the master that manages the file system’s metadata, such as which files are stored on which DataNodes.
- DataNode: These are the worker nodes that store the actual data blocks.