MongoDB代写:COMP5338 Advanced Data Models

代写MongoDB的作业,作业类型其实偏向数据分析而不是MongoDB的用法。数据集很大,实现要求的查询需求即可。

Introduction

In this assignment, you will show that you can work with different NoSQL systems for data persistence and understand the strength and weakness of each system with respect to certain workload features. You are asked to work with a music data set and a list of target queries. You will design data schema based on the data set feature and the given query workload for MongoDB, Neo4j and HBase respectively. You will show that your design can support the target queries by load the data in each system following your schema and run queries against the data.

Data set

The data that you will use is the Last.fm (http://www.last.fm)data set released in Het-Rec2011 (http://ir.ii.uam.es/hetrec2011/). You can view the details of the data set and download it from http://grouplens.org/datasets/hetrec-2011/.
The data set contains information about “social networking, tagging, and music artist listening information from a set of 2k users from Last.fm online music system”. The data is organized as relational tables and is stored in several text files, each corresponding to a table. All files are of tab separated format. Unique IDs are assigned to artists, users and tags for easy referencing across files.
Basic information about artists is stored in file artists.dat. Each line contains four columns: id, name of the artist, an url pointing to the description of the artist and another pictureURL pointing to a picture of the artist. The tag data is stored in another file tags.dat with only two columns: the tagID and the actual tagValue. The data set does not contain any personal information about user, hence there is no separate file for users. All other files in the data set contain information about some relationships.
The user artists.dat stores information about listening count per user per artist. The user friends.dat stores the friend relations between users. The tag assignments of artists by users are stored in two files. The only difference between the files is the timestamp format. The file user taggedartists.dat stores the day, month and year in separate columns while user taggedartists-timestamps.dat stores the unix timestamp in a single column. You only need to use one of the files for tag information in this assignment.

Target Queries

Simple query

  • given a user id, find all artists the user’s friends listen.
  • given an artist name, find the most recent 10 tags that have been assigned to it.
  • given an artist name, find the top 10 users based on their respective listening counts of this artist. Display both the user id and the listening count
  • given a user id, find the most recent 10 artists the user has assigned tag to.

Complex queries

  • find the top 5 artists ranked by the number of users listening to it
  • given an artist name, find the top 20 tags assigned to it. The tags are ranked by the number of times it has been assigned to this artist
  • given a user id, find the top 5 artists listened by his friends but not him. We rank artists by the sum of friends’ listening counts of the artist.
  • given an artist name, find the top 5 similar artists. Here similarity between a pair of artists is defined by the number of unique users that have listened both. The higher the number, the more similar the two artists are.

Tasks

Your tasks include following.

Schema Design for each system

You should provide three schema design versions. For MongoDB and Neo4j, your schema should support both the simple queries and the complex queries. For HBase, your schema only needs to support the simple queries. For each schema version, make sure you utilize features provided by the storage system such as indexing, aggregation, ordering, filtering and so on. Please note that your schema may deviate a lot from the relational structure of the original data set. You can discard IDs if you find they are not useful. You can duplicate data if you find that helps with certain queries. You will not get any mark if you present a schema that is an exact copy of the relational structure in the original data set.

Query Design and Implementation

For MongoDB, load the complete data on MongoDB and set up proper indexes that will be used by the target queries. Design and implement all target queries. You may implement a query using shell command, a combination of JavaScript and shell command or as Python/Java program. For each query (or sub query), report execution statistics such as: which index is used, how many documents are examined to answer this query.
For Neo4j, load the complete data on Neo4j and set up proper indexes that will be used by the target queries. Design and implement all target queries. You may implement a query using cypher command or as Pyton/Java program. For each query, report execution statics such as which index is used, how many records are examined, whether or not a full scan is involved.
For HBase, load a small subset of data that can demonstrate the simple queries. Design and implement only the simple queries. You may implement a query using shell command, a combination of Ruby script and shell command, or as Python/Java program. You can use filter as well. For each query, describe the number of rows, or subset of columns are examined. Especially, highlight if a full table scan is involved in answering this query.

Deliverable and Submission Guideline

This is a group project, each group can have up to 3 students. Each group needs to produce the following.

A Written Report

The report should contain five sections. The first section is a brief introduction of the project. Section two to four should cover a system each. Section five should provide a comparison and summary.
There is no mark on section one. It is included to make the report complete. So please keep it really short.
Section two to four should contain the following two sub sections

Schema Design

In this section, describe the schema with respect to the particular system. Your description should include information at “table” and “column” level as well as possible primary keys/row keys and secondary indexes. You should show sample data based on schema. For instance, you may show sample documents of each MongoDB collection, a sample property graph involving all node types and relationship types for Neo4j, a sample row in each HBase table. If certain data are duplicated in different collection/tables, highlight the duplication and briefly justify your decision.

Query Design and Execution

In this section, describe implementation of each target query. You may include the entire shell query command, or show pseudo code. You should also run sample queries and describe the execution statistics for each sample query as described in the Tasks section.
In section five, compare the three systems with respect to ease of use, query performance and schema differences. You can also describe problems encountered in schema design or query design.
Submit a hard copy of your report together with a signed group assignment sheet in Week 10 lab.

System Demo

Each group will demo their implementation in week 10 lab. The required data need to be loaded in respective systems for the demo. You can run demo in your own machine, on lab machine or on some cloud servers. The tutor will ask you to run a few randomly selected queries to test the implementation.