Cloud Computing, Hadoop, MapReduce

This module will provide an introduction to the Utility Cloud. It will also describe a relatively new technique for scaling the processing of massive data in cloud.

• Describe the principles of the utility cloud.
• Use the MapReduce approach to write pseudocode that manipulates large data on the cloud.
• Explain the elements of the MapReduce architecture.

Apache Hadoop

1. Discuss where the MapReduce paradigm fits into the Utility Cloud.
2. Discuss the characteristics of distributed algorithms that are best supported by the MapReduce paradigm verses those that are not. Provide some examples.

1. Suppose that our Hadoop system contains a large terabyte-sized file with π written out to 10¹² places. HDFS divides the file into many shards. Write MapReduce pseudocode to determine the number of 3 digit combinations of digits contained in the decimal portion of π. For example, given π written as:


The result file should consist of <key, value> pairs that look like the following:

<141, 2>
<415, 2>
<159, 1>
<592, 2>
<926, 1>

Write the Map and Reduce pseudocode. Do not concern yourselves with three digit combinations that span shards, where for example one digit is at the end of one shard and two digits are at the beginning of the next shard.

a) Map algorithm
b) Reduce algorithm

MapReduce, Hadoop & HDFS

This module will provide an in-depth description of the Hadoop Data File System (HDFS) as well as discuss some design patterns for advanced MapReduce techniques.

• Describe the underpinnings of HDFS.
• Use some basic HDFS command line calls.
• Write HDFS code with data compression and decompression.
• Write MapReduce code using advanced design patterns.

1. Discuss modifications to the HDFS architecture that would enable better performance for operations that require random access to the data. Are these modifications worth it?

• Map algorithm
Map(key: null, value: source){
For each (target in source)
Write (target, 1)

• Reduce algorithm
Reduce(key: target, value: number){
sum = 0
for each(target):
sum += number
write(target, sum)

