Hadoop代写:CA618 Map Reduce



This assignment aims to assess your understanding of Map Reduce framework and programming a distributed program using this framework.


This assignment consist of 2 parts: Theoritical part (asssessed by a Quiz) and Practical part.

Part 1: Quiz

A closed book Mylo Quiz will be conducted during lecture 8. Its weightage will be 3%. The multiple type questions will be asked from Lecture slides 6 and 7.

Part 2: Practical Part

Here, you need to implement a Map Reduce code for Hadoop that analyses given weather data. This part of the assignment consists of two further sub-tasks: Basic level and Advance level.


Input data will be several .csv files for different years. Each file contains several rows giving information about weather conditions at different weather stations on different days of the year. The data is from ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/ There are at least two measurements each day, one for the maximum temperature (TMAX) and one for the minimum temperature (TMIN), and sometimes one for the precipitation (PRCP). Each row contains following relevant information:

  1. The weather station id
  2. the date in format yyyymmdd
  3. type of measurement (for this homework we care about the maximum temperature TMAX and TMIN)
  4. temperature in tens of degrees (e.g. -90 = -9.0 deg. C., -184 = -18.4 deg. C.)

Outline of Tasks

Basic level: Finding Average

In first task, your goal is to write a Map Reduce program that can find the average maximum temperature at each station in different years. The input to your program will be the csv files for different years provided to you. The ouptut should have rows with three fields: Stationid Year AverageTemp. For example a sample output file will look like:

ITE00100554    1789,    -63
ITE00100554    1789    -90
GM000010962    1789    4
EZE00100082    1789    -103

Advanced level: finding similarity between different stations

The goal of this task is to implement a MapReduce program that can find similarity between different weather stations. Similarity between two stations is calculated based on the following:

You can assume output from the previous task as input to this task. Output for this task will be in following format:

weatherStationID1  weatherStationID2 SimilarityScore.


a) Source Codes of 2 Tasks
b) A report explaining map/reduce program. If any optimisation such as using combiner to reduce number of keys, is done to improve the performance, please also specify with that explaination. If you have taken inspiration from some MapReduce programs to complete these tasks, please give their reference.