- Gain in depth experience playing around with big data tools (MapReduce, Hive and Spark).
- Solve challenging big data processing tasks by finding highly efficient solutions.
- Experience processing three different types of real data
a. Standard multi-attribute data (Bank data)
b. Time series data (Twitter feed data)
c. Bag of words data.
- Practice using programming APIs to find the best API calls to solve your problem. Here are the API descriptions for MapReduce, Hive and Spark (especially spark look under RDD. There are a lot of really useful API calls).
- [MapReduce] https://hadoop.apache.org/docs/stable/api/
- [Hive] https://cwiki.apache.org/confluence/display/Hive/LanguageManual
- [Spark] http://spark.apache.org/docs/latest/api/scala/index.html#package
- If you are not sure what a spark API call does, try to write a small example and try it in the spark shell.
a) In general, writing more efficient code (less reading/writing from/into HDFS and less data shuffles) will be rewarded with more marks.
b) All MapReduce code you submit must be able to be compiled using the command
javac -classpath `hadoop classpath` <code_files>
on the Cloudera VM you received from us without requiring the installation of additional components.
c) All MapReduce code you submit should be runnable using
hadoop jar <jar_uri> <hdfs_input_file> <hdfs_output_directory>
For task 2C you need to allow the user to specify another two parameters being the x and y months respectively.
d) Using multiple MapReduce phases maybe appropriate for some of the subtasks. However, if you utilize multiple phases to solve a task, maintain a meaningful and logically consistent naming scheme for your files. (e.g.: Phase1.java, Phase2.java, …)
e) For hive and spark code submissions, ensure that all commands relevant to accomplish the sub-task (i.e. ‘create table’ (hive), loading data AND queries!) are in the same file.
f) Scalability of the code is very important. This is especially important in terms of memory requirements of the mappers and reducers. For example writing a mapper that outputs the same key for any input, will result in all the data going to a single reducer (no matter how many reducers you set). For example, if your mapper takes any string as input and always outputs the same key abc. This effectively means you will end up writing a sequential program. This is completely unacceptable and will result in zero marks for that subtask.
g) This entire assignment can be done using the Cloudera virtual machines supplied in the labs and the supplied data sets without running out of memory. Note task 3 is especially hard to do without running out of memory. But it is possible since we had done it. So it is time to show your skills!
h) Using combiners or local aggregation (inside the mapper) for MapReduce tasks where appropriate will be rewarded with marks. We will be looking at the total amount of data shuffled and awarding higher marks to lower amount of data shuffled.
i) Where ever appropriate use the fact the data is sorted according to intermediate key to reduce the work of the mapper and/or reducer.
j) I am not too fussed about the layout of the output. As long as it looks similar to the example outputs for each task. That will be good enough. The idea is not to spend too much time massaging the output to be the right format but instead to spend the time to solve problems.
k) For Hive queries. We prefer answers that use less tables.
Do the entire assignment using the Cloudera VM. Do not use AWS.
- Look at the data files before you begin each task. Try to understand what you are dealing with! You may find the shell commands “cat” and “head” helpful.
- For each subtask we give very small example input and the corresponding output in the assignment specifications below. You should create input files that contain the same data as the example input and then see if your solution generates the same output.
- In addition to testing the correctness of your code using the very small example input. You should also use the large input files that we provide to test the scalability of your solutions.
We will be doing some analytics on real data from a Portuguese banking institution. The data is related to their marketing campaign.
The data set used for this task can be found inside the bank directory of the assignment_datafiles.zip on LMS.
The data has the following attributes
|Attribute number||Attribute name||Description|
|2||job||type of job (categorical: “admin.”, “unknown”, “unemployed”, “management”, “housemaid”, “entrepreneur”, “student”, “blue-collar”, “self-employed”, “retired”, “technician”, “services”)|
|3||marital||marital status (categorical: “married”, “divorced”, “single”; note: “divorced” means divorced or widowed)|
|4||education||(categorical: “unknown”, “secondary”, “primary”, “tertiary”)|
|5||default||has credit in default? (binary: “yes”, “no”)|
|6||balance||average yearly balance, in euros (numeric)|
|7||housing||has housing loan? (binary: “yes”, “no”)|
|8||loan||has personal loan? (binary: “yes”, “no”)|
|9||contact||contact communication type (categorical: “unknown”, “telephone”, “cellular”)|
|10||day||last contact day of the month (numeric)|
|11||month||last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”)|
|12||duration||last contact duration, in seconds (numeric)|
|13||campaign||number of contacts performed during this campaign and for this client (numeric, includes last contact)|
|14||pdays||number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)|
|15||previous||number of contacts performed before this campaign and for this client (numeric)|
|16||poutcome||outcome of the previous marketing campaign (categorical: “unknown”, “other”, “failure”, “success”)|
|17||Term deposit||has the client subscribed a term deposit? (binary: “yes”,”no”)|
Here is a small example of the bank data that we will use to illustrate the subtasks below (we only list a subset of the attributes in this example, see the above table for the description of the attributes):
Using the entire bank data set downloaded from LMS please perform the following tasks. Please note we specify whether you should use [MapReduce] or [Hive] for each subtask at the beginning of each subtask.
a) [MapReduce] Report the number of clients of each job category. For the above small example data set you would report the following (output order is not important for this question):
management 2 technician 3 blue-collar 1 services 1 entrepreneur 1
b) [Hive] Report the rounded average yearly income for all people in each education category. For the small example data set you would report the following (output order is not important for this question):
tertiary 1031 secondary 287 primary 10 unknown 1506
c) [Hive] For each marital status report the percentage of people who have a personal loan. Hint you may need to use multiple queries or subqueries. For the small example data set you would report the following (output order is not important for this question):
Married 50% Divorced 67% Single 0%
d) [MapReduce] Group balance into the following three categories:
a. Low-infinity to 500
b. Medium 501 to 1500
c. High 1501 to +infinity
Report the number of people in each of the above categories. For the small example data set you would report the following (output order is not important in this question):
Low 4 Medium 2 High 2
e) [MapReduce] For each education category report a list of people in descending order of balance. For each person report the following attribute values: education category, balance, job, marital, loan. Note this subtask can be done using a single or multiple MapReduce tasks. For the small example data set you would report the following (output order for education does not matter but order does matter for the attribute balance):
primary, 10, technician, married, no secondary, 829, services, divorced, yes secondary, 29, technician, divorced, yes secondary, 2, entrepreneur, single, no tertiary, 2143, management, married, yes tertiary, 929, technician, married, yes tertiary, 22, management, divorced, no unknown, 1506, blue-collar, married, no