Hadoop代写：CSE3BDC Big Data Tools Task1|留学生CS代写|代做Java编程|C作业|C++程序|Python代码

Introduction

大数据的一个作业，要求使用MapReduce, Hive和Spark来对大数据进行处理。
工作量主要是体现在环境的搭建，时间主要是花在数据的导入上面，此外代码的调试也是花时间的一个地方。
Task1涉及到MapReduce以及Hive的编程。

Objectives

Gain in depth experience playing around with big data tools (MapReduce, Hive and Spark).
Solve challenging big data processing tasks by finding highly efficient solutions.
Experience processing three different types of real data
a. Standard multi-attribute data (Bank data)
b. Time series data (Twitter feed data)
c. Bag of words data.
Practice using programming APIs to find the best API calls to solve your problem. Here are the API descriptions for MapReduce, Hive and Spark (especially spark look under RDD. There are a lot of really useful API calls).
- [MapReduce] https://hadoop.apache.org/docs/stable/api/
- [Hive] https://cwiki.apache.org/confluence/display/Hive/LanguageManual
- [Spark] http://spark.apache.org/docs/latest/api/scala/index.html#package
- If you are not sure what a spark API call does, try to write a small example and try it in the spark shell.

Expected quality of solutions

a) In general, writing more efficient code (less reading/writing from/into HDFS and less data shuffles) will be rewarded with more marks.
b) All MapReduce code you submit must be able to be compiled using the command

javac -classpath `hadoop classpath` <code_files>

on the Cloudera VM you received from us without requiring the installation of additional components.
c) All MapReduce code you submit should be runnable using

hadoop jar <jar_uri> <hdfs_input_file> <hdfs_output_directory>

For task 2C you need to allow the user to specify another two parameters being the x and y months respectively.
d) Using multiple MapReduce phases maybe appropriate for some of the subtasks. However, if you utilize multiple phases to solve a task, maintain a meaningful and logically consistent naming scheme for your files. (e.g.: Phase1.java, Phase2.java, …)
e) For hive and spark code submissions, ensure that all commands relevant to accomplish the sub-task (i.e. ‘create table’ (hive), loading data AND queries!) are in the same file.
f) Scalability of the code is very important. This is especially important in terms of memory requirements of the mappers and reducers. For example writing a mapper that outputs the same key for any input, will result in all the data going to a single reducer (no matter how many reducers you set). For example, if your mapper takes any string as input and always outputs the same key abc. This effectively means you will end up writing a sequential program. This is completely unacceptable and will result in zero marks for that subtask.
g) This entire assignment can be done using the Cloudera virtual machines supplied in the labs and the supplied data sets without running out of memory. Note task 3 is especially hard to do without running out of memory. But it is possible since we had done it. So it is time to show your skills!
h) Using combiners or local aggregation (inside the mapper) for MapReduce tasks where appropriate will be rewarded with marks. We will be looking at the total amount of data shuffled and awarding higher marks to lower amount of data shuffled.
i) Where ever appropriate use the fact the data is sorted according to intermediate key to reduce the work of the mapper and/or reducer.
j) I am not too fussed about the layout of the output. As long as it looks similar to the example outputs for each task. That will be good enough. The idea is not to spend too much time massaging the output to be the right format but instead to spend the time to solve problems.
k) For Hive queries. We prefer answers that use less tables.

Do the entire assignment using the Cloudera VM. Do not use AWS.

Tips:

Look at the data files before you begin each task. Try to understand what you are dealing with! You may find the shell commands “cat” and “head” helpful.
For each subtask we give very small example input and the corresponding output in the assignment specifications below. You should create input files that contain the same data as the example input and then see if your solution generates the same output.
In addition to testing the correctness of your code using the very small example input. You should also use the large input files that we provide to test the scalability of your solutions.

Task 1: Analysing Bank Data

We will be doing some analytics on real data from a Portuguese banking institution. The data is related to their marketing campaign.
The data set used for this task can be found inside the bank directory of the assignment_datafiles.zip on LMS.
The data has the following attributes

Attribute number	Attribute name	Description
1	age	numeric
2	job	type of job (categorical: “admin.”, “unknown”, “unemployed”, “management”, “housemaid”, “entrepreneur”, “student”, “blue-collar”, “self-employed”, “retired”, “technician”, “services”)
3	marital	marital status (categorical: “married”, “divorced”, “single”; note: “divorced” means divorced or widowed)
4	education	(categorical: “unknown”, “secondary”, “primary”, “tertiary”)
5	default	has credit in default? (binary: “yes”, “no”)
6	balance	average yearly balance, in euros (numeric)
7	housing	has housing loan? (binary: “yes”, “no”)
8	loan	has personal loan? (binary: “yes”, “no”)
9	contact	contact communication type (categorical: “unknown”, “telephone”, “cellular”)
10	day	last contact day of the month (numeric)
11	month	last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”)
12	duration	last contact duration, in seconds (numeric)
13	campaign	number of contacts performed during this campaign and for this client (numeric, includes last contact)
14	pdays	number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)
15	previous	number of contacts performed before this campaign and for this client (numeric)
16	poutcome	outcome of the previous marketing campaign (categorical: “unknown”, “other”, “failure”, “success”)
17	Term deposit	has the client subscribed a term deposit? (binary: “yes”,”no”)

Here is a small example of the bank data that we will use to illustrate the subtasks below (we only list a subset of the attributes in this example, see the above table for the description of the attributes):

job	marital	education	balance	loan
management	Married	tertiary	2143	Yes
technician	Divorced	secondary	29	Yes
entrepreneur	Single	secondary	2	No
blue-collar	Married	unknown	1506	No
services	Divorced	secondary	829	Yes
technician	Married	tertiary	929	Yes
Management	Divorced	tertiary	22	No
technician	Married	primary	10	No

Using the entire bank data set downloaded from LMS please perform the following tasks. Please note we specify whether you should use [MapReduce] or [Hive] for each subtask at the beginning of each subtask.
a) [MapReduce] Report the number of clients of each job category. For the above small example data set you would report the following (output order is not important for this question):

management 2
technician 3
blue-collar 1
services 1
entrepreneur 1

b) [Hive] Report the rounded average yearly income for all people in each education category. For the small example data set you would report the following (output order is not important for this question):

tertiary 1031
secondary 287
primary 10
unknown 1506

c) [Hive] For each marital status report the percentage of people who have a personal loan. Hint you may need to use multiple queries or subqueries. For the small example data set you would report the following (output order is not important for this question):

Married 50%
Divorced 67%
Single 0%

d) [MapReduce] Group balance into the following three categories:
a. Low-infinity to 500
b. Medium 501 to 1500
c. High 1501 to +infinity
Report the number of people in each of the above categories. For the small example data set you would report the following (output order is not important in this question):

Low 4
Medium 2
High 2

e) [MapReduce] For each education category report a list of people in descending order of balance. For each person report the following attribute values: education category, balance, job, marital, loan. Note this subtask can be done using a single or multiple MapReduce tasks. For the small example data set you would report the following (output order for education does not matter but order does matter for the attribute balance):

primary, 10, technician, married, no

secondary, 829, services, divorced, yes
secondary, 29, technician, divorced, yes
secondary, 2, entrepreneur, single, no

tertiary, 2143, management, married, yes
tertiary, 929, technician, married, yes
tertiary, 22, management, divorced, no

unknown, 1506, blue-collar, married, no