The goal of the assignment work in this course is to gain practical experience that helps you to:
- explain data structures and algorithms used to efficiently store and retrieve information in database systems;
- investigate alternative approaches for design of database systems (both relational and non-relational); and
- design and implement (using Java) file structures and indexing schemes.
The aim of the first assignment is to start using the AWS linux instance assigned to you and the open data provided from a public source to complete the following tasks:
- store and retrieve data in an Apache Derby relational database that you create,
- store and retrieve data in a MongoDB database that you create,
- store and retrieve data in a heap file that you implement using Java.
In the second assignment, you will extend your solution developed in this first assignment and conduct further timing experiments on your AWS linux instance with indexes using each of these approaches.
Please read ALL the following requirements carefully before you start.
The data that you are going to use in this assignment is open data from the City of Melbourne about On-street Car Parking Sensor Data-2017 Dataset. As well as the actual data there is a help file available that describes the data set.
This section contains information about the general requirements that your assignment must meet and how to get help.
- Your database and Java programs must be set up and run on the AWS linux machine assigned to you for this course.
- Your database must be set up on your AWS linux instance (as set up following the instructions in the initial practical classes in the laboratories).
- You must implement your program in Java. Your program must be well written, using good coding style and including appropriate use of comments (that clearly identify the changes you are making to the code). Your markers will look at your source code. Coding style will form part of the assessment of this assignment.
- If your marker cannot compile your programs, you risk yielding zero marks for the coding component of your assignment.
- Your Java program may be developed on any machine, but must compile and run your AWS linux instance.
- You must use git as you develop your code (wherever you do the development). As you work on the assignment you should commit your changes to git regularly (for example, hourly or each time you rebuild) as the log may be used as evidence of your progress.
- Paths must not be hard-coded.
- Diagnostic messages must be output to stderr.
- Parts of this assignment will ask you to analyse your results, and to write your conclusions in a report. The report MUST be a PDF file. Submissions that do not meet this requirement will NOT be marked.
- Your report must be well-written. Poorly written or hard to read reports will receive substantially lower marks. Your report should be appropriate to submit in a professional environment (such as including in a portfolio of your work for a prospective employer).
- All sections of this assignment are expected to show that you have thought about the problem. The most basic structuring of data and analysis will get the most basic mark.
- Canvas for COSC2406/COSC2407 Database Systems contains a discussion board for this assignment allowing a forum for students to ask questions (see below) and contribute to discussion about aspects of the assignment. If there are announcements about the assignment (including if there are any revisions to the assignment specification) these will also be made via announcements on Canvas. You are expected to check these on a daily basis.
- If you have any questions about the assignment (for example to clarify requirements):
- (a) Please first check this assignment specification, as well the announcements and the discussion board on canvas to see if it has already been answered.
- (b) If it has NOT already been answered and does NOT include your own code (including database queries), please post your question on the discussion board.
- (c) Otherwise, if your question involves your own code (or is about your personal situation) then discuss it in your practical class with the lab instructor or contact the lecturer (or your tutor) via email.
In this section, you will be asked to carry out several tasks using your AWS linux instance to create and query a database (using both MongoDB and Derby), with the data in the provided file and to analyse your results.
Create a file called report.pdf (various software including word processors can export as PDF). Use this file to report on the following tasks. Each task should be reported under a separate heading with the task name and description, for example for the first task use the heading: Task 1: Derby.
You are required to load the data into Derby. In your report:
- explain how have you chosen to structure the Derby relational database and give reasons.
- provide details of the time to load the data into Derby. You need to analyse the data and consider appropriate ways to structure the data and then using any scripting, programming or other tools to format the data accordingly.
- Postgraduate students only: What alternative way or ways could you have organised the data when storing in Derby, and what advantages or disadvantages would these alternative designs have?
You are required to load the data into MongoDB. In your report:
- explain how have you chosen to structure the data inserted in MongoDB
- provide details of the time taken to load the data (The mongoimport is one utility will provide such information). Please note that a naive import into a flat structure in Mongodb will not accrue you a great mark. You need to analyse the data and consider appropriate ways to structure the data and then using any scripting, programming or other tools to format the data accordingly.
- Postgraduate students only: What alternative way or ways could you have organised the data when storing in MongoDB, and what advantages or disadvantages would these alternative designs have?
Set up a git repository for your code, and complete the following programming tasks using Java on the AWS linux instance assigned to you.
The source records are variable-length. Your heap file may hold fixed-length records (you will need to choose appropriate maximum lengths for each field). However, you may choose to implement variable lengths for some fields, especially if you run out of disc space or secondary memory!
All attributes with Int type must be stored in 4 bytes of binary, e.g. if the value of I D is equal to 70, it must be stored as 70 (in decimal) or 46 (in hexadecimal; in Java: 0x46). It must not be stored as the string “70”, occupying two bytes. Your heap file is therefore a binary file.
For simplicity, the heap file does not need a header (containing things like the number of records in the file or a free space list), though you might need to keep a count of records in each page. The file should be packed, i.e. there is no gap between records, but there will need to be gaps at the end of each page.
The executable name of your program to build a heap file must be dbload and should be executed using the command:
java dbload -p pagesize datafile
The output file will be heap.pagesize where your converted binary data is written as a heap.
Your program should write out one “page” of the file at a time. For example, with a pagesize of 4096, you would write out a page of 4096 bytes possibly containing multiple records of data to disk at a time. You are not required to implement spanning of records across multiple pages.
Your dbload program must also output the following to stdout, the number of records loaded, number of pages used and the number of milliseconds to create the heap file.
Write a program to perform text query search operations on the field “DA NAME” heap file (without an index) produced by your dbload program in Section 5. Note that “DA NAME” is a new field you created by considering “DeviceId” and “ArrivalTime” as strings and connecting them together.
The executable name of your program to build a heap file must be dbquery and should be executed using the command:
java dbquery text pagesize
Your program should read in the file, one “page” at a time. For example, if the pagesize parameter is 4096, your program should read in the records in the first page in heap.4096 from disk. These can then be scanned, in-memory, for a match (the string in text parameter is contained in the field “DA NAME”). If a match is found, print the matching record to stdout, there may be multiple answers. Then read in the next page of records from the file. The process should continue until there are no more records in the file to process.
In addition, the program must always output the total time taken to do all the search operations in milliseconds to stdout.
Before you submit anything, read through the assignment specifications again carefully, especially Section 4. Check that you have followed ALL instructions. Also check that you have attempted all parts of all tasks in Section 5.