Hadoop代写:CSE3BDC Big Data Tools Task1

Introduction

大数据的一个作业,要求使用MapReduce, Hive和Spark来对大数据进行处理。
工作量主要是体现在环境的搭建,时间主要是花在数据的导入上面,此外代码的调试也是花时间的一个地方。
Task1涉及到MapReduce以及Hive的编程。

Objectives

  1. Gain in depth experience playing around with big data tools (MapReduce, Hive and Spark).
  2. Solve challenging big data processing tasks by finding highly efficient solutions.
  3. Experience processing three different types of real data
    a. Standard multi-attribute data (Bank data)
    b. Time series data (Twitter feed data)
    c. Bag of words data.
  4. Practice using programming APIs to find the best API calls to solve your problem. Here are the API descriptions for MapReduce, Hive and Spark (especially spark look under RDD. There are a lot of really useful API calls).

Expected quality of solutions

a) In general, writing more efficient code (less reading/writing from/into HDFS and less data shuffles) will be rewarded with more marks.
b) All MapReduce code you submit must be able to be compiled using the command

javac -classpath `hadoop classpath` <code_files>

on the Cloudera VM you received from us without requiring the installation of additional components.
c) All MapReduce code you submit should be runnable using

hadoop jar <jar_uri> <hdfs_input_file> <hdfs_output_directory>

For task 2C you need to allow the user to specify another two parameters being the x and y months respectively.
d) Using multiple MapReduce phases maybe appropriate for some of the subtasks. However, if you utilize multiple phases to solve a task, maintain a meaningful and logically consistent naming scheme for your files. (e.g.: Phase1.java, Phase2.java, …)
e) For hive and spark code submissions, ensure that all commands relevant to accomplish the sub-task (i.e. ‘create table’ (hive), loading data AND queries!) are in the same file.
f) Scalability of the code is very important. This is especially important in terms of memory requirements of the mappers and reducers. For example writing a mapper that outputs the same key for any input, will result in all the data going to a single reducer (no matter how many reducers you set). For example, if your mapper takes any string as input and always outputs the same key abc. This effectively means you will end up writing a sequential program. This is completely unacceptable and will result in zero marks for that subtask.
g) This entire assignment can be done using the Cloudera virtual machines supplied in the labs and the supplied data sets without running out of memory. Note task 3 is especially hard to do without running out of memory. But it is possible since we had done it. So it is time to show your skills!
h) Using combiners or local aggregation (inside the mapper) for MapReduce tasks where appropriate will be rewarded with marks. We will be looking at the total amount of data shuffled and awarding higher marks to lower amount of data shuffled.
i) Where ever appropriate use the fact the data is sorted according to intermediate key to reduce the work of the mapper and/or reducer.
j) I am not too fussed about the layout of the output. As long as it looks similar to the example outputs for each task. That will be good enough. The idea is not to spend too much time massaging the output to be the right format but instead to spend the time to solve problems.
k) For Hive queries. We prefer answers that use less tables.

Do the entire assignment using the Cloudera VM. Do not use AWS.

Tips:

  1. Look at the data files before you begin each task. Try to understand what you are dealing with! You may find the shell commands “cat” and “head” helpful.
  2. For each subtask we give very small example input and the corresponding output in the assignment specifications below. You should create input files that contain the same data as the example input and then see if your solution generates the same output.
  3. In addition to testing the correctness of your code using the very small example input. You should also use the large input files that we provide to test the scalability of your solutions.

Task 1: Analysing Bank Data

We will be doing some analytics on real data from a Portuguese banking institution. The data is related to their marketing campaign.
The data set used for this task can be found inside the bank directory of the assignment_datafiles.zip on LMS.
The data has the following attributes

Attribute number Attribute name Description
1 age numeric
2 job type of job (categorical: “admin.”, “unknown”, “unemployed”, “management”, “housemaid”, “entrepreneur”, “student”, “blue-collar”, “self-employed”, “retired”, “technician”, “services”)
3 marital marital status (categorical: “married”, “divorced”, “single”; note: “divorced” means divorced or widowed)
4 education (categorical: “unknown”, “secondary”, “primary”, “tertiary”)
5 default has credit in default? (binary: “yes”, “no”)
6 balance average yearly balance, in euros (numeric)
7 housing has housing loan? (binary: “yes”, “no”)
8 loan has personal loan? (binary: “yes”, “no”)
9 contact contact communication type (categorical: “unknown”, “telephone”, “cellular”)
10 day last contact day of the month (numeric)
11 month last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”)
12 duration last contact duration, in seconds (numeric)
13 campaign number of contacts performed during this campaign and for this client (numeric, includes last contact)
14 pdays number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)
15 previous number of contacts performed before this campaign and for this client (numeric)
16 poutcome outcome of the previous marketing campaign (categorical: “unknown”, “other”, “failure”, “success”)
17 Term deposit has the client subscribed a term deposit? (binary: “yes”,”no”)

Here is a small example of the bank data that we will use to illustrate the subtasks below (we only list a subset of the attributes in this example, see the above table for the description of the attributes):

job marital education balance loan
management Married tertiary 2143 Yes
technician Divorced secondary 29 Yes
entrepreneur Single secondary 2 No
blue-collar Married unknown 1506 No
services Divorced secondary 829 Yes
technician Married tertiary 929 Yes
Management Divorced tertiary 22 No
technician Married primary 10 No

Using the entire bank data set downloaded from LMS please perform the following tasks. Please note we specify whether you should use [MapReduce] or [Hive] for each subtask at the beginning of each subtask.
a) [MapReduce] Report the number of clients of each job category. For the above small example data set you would report the following (output order is not important for this question):

management 2
technician 3
blue-collar 1
services 1
entrepreneur 1

b) [Hive] Report the rounded average yearly income for all people in each education category. For the small example data set you would report the following (output order is not important for this question):

tertiary 1031
secondary 287
primary 10
unknown 1506

c) [Hive] For each marital status report the percentage of people who have a personal loan. Hint you may need to use multiple queries or subqueries. For the small example data set you would report the following (output order is not important for this question):

Married 50%
Divorced 67%
Single 0%

d) [MapReduce] Group balance into the following three categories:
a. Low-infinity to 500
b. Medium 501 to 1500
c. High 1501 to +infinity
Report the number of people in each of the above categories. For the small example data set you would report the following (output order is not important in this question):

Low 4
Medium 2
High 2

e) [MapReduce] For each education category report a list of people in descending order of balance. For each person report the following attribute values: education category, balance, job, marital, loan. Note this subtask can be done using a single or multiple MapReduce tasks. For the small example data set you would report the following (output order for education does not matter but order does matter for the attribute balance):

primary, 10, technician, married, no

secondary, 829, services, divorced, yes
secondary, 29, technician, divorced, yes
secondary, 2, entrepreneur, single, no

tertiary, 2143, management, married, yes
tertiary, 929, technician, married, yes
tertiary, 22, management, divorced, no

unknown, 1506, blue-collar, married, no