Database代写:CS186 Join Algorithms and Query Optimization

代写数据库中Join查询算法,同时需要对查询进行优化。

Background

This project builds on top of the code and functionality that you should be familiar with from Project 1. In this project, you will be implementing several of the join algorithms covered in class as well as a costbased query optimizer. We have released the staff solutions for Project 1 to a private repository link can be found on Piazza, and you are free to integrate those into your codebase. Though if you feel confident about your own solutions to Project 1, you may stick with those as well. Regardless, a correct working implementation of Project 1 is necessary in order to complete this project.

This project uses the same functionoriented query interface we implemented in Project 1, but with a few differences. The original implementation was inefficient because at every stage of the query processing pipeline, we would fully materialize every record in memory. This means the query operators execute in sequential batches, and furthermore queries will fail if data does not fit in memory. To address this, we have implemented an Iterator interface directly into each query operator, and you’ll be adding some more iterators for the join operators in this project!

Additionally, the original query plan would always execute a naive plan without any optimizations built in. We’ve included a separate execution method that will leverage a query optimizer that you will be filling in to generate better plans according to the System R costbased approach we’ve covered in class.

Getting Started

As usual, you can get the assignment from the Github course repository assuming you’ve set things up as we asked you to in homework 0 by running gitpullcoursemaster. This will get you all of the starter code we’ve provided, this project spec, and all of the tests that we’ve written for you.

We’ve also generated all of the API documentation of the code as webpages viewable here

Setup

All of the Java, Maven, and IntelliJ setup from Project 1 is still applicable for this project. Assuming that you were able to complete Project 1, you should not need to do any additional setup for this project. Refer to the Project 1 spec for setup details.

Starter Code

Package Overview

There’s a bunch of code that we’ve provided for you, both from Project 1 and some new additions in this project. We strongly suggest that you spend some time looking through it before starting to write your own code. The Project 1 spec contains descriptions for each package of the codebase, but the relevant parts for this project are all in the query package. Briefly, it consists of a QueryPlan class that holds the query optimizer, and several classes that all inherit from QueryOperator with each class representing a different operator in the plan tree.

All of the starter code descriptions from Project 1 are still valid. We have modified some classes and added additional functionality, but for the most part the code is the same except where we have specifically noted differences here in this document.

Query Generation

The QueryPlan interface allows you to generate SQLlike queries without having to parse actual SQL queries. We use the same interface from Project 1 to create queries, but whereas before you would call the naive QueryPlan#execute method to execute your query, now there exists a QueryPlan#executeOptimal which will generate an optimal plan, and then execute your query.

If you would like to run queries on the database, you can create a new QueryPlan by calling Transaction#query and passing the name of the base table for the query. You can then call the QueryPlan#where, QueryPlan#join, etc. methods in order to generate as simple or as complex a query as you would like. Finally, call QueryPlan#executeOptimal to run the query optimizer and execute the query and get a response of the form Iterator[Record]. You can also use the Transaction#queryAs methods to alias tables.

As a quick example, a simple query might look something like this.

You can also examine the query plan tree generated by the query optimizer by printing the final operator of the query plan:
You can find more examples in the OptimalQueryPlanTest and OptimalQueryPlanJoinsTest files.

Your Assignment

Alright, now we can write some code! NOTE: Throughout this project, you’re more than welcome to add any and all helper methods you’d like to write. However, it is very important that you do not change any of the interfaces that we’ve given you.

It’s also a good idea to always check the course repo for updates on the project.

Part 1: Join Algorithms

The first part of the project involves completing the implementations of PNLJOperator, BNLJOperator, and GraceHashOperator, which correspond to page nested loop join, block nested loop join, and grace hash join, respectively. Simple nested loop join has already been implemented for you in SNLJOperator. It is recommended that you look through the code for SNLJOperator, and use this as an example of how to implement a join algorithm before starting the others.

As a brief overview of how the join algorithms in this project work, each join operator first reads in the records from the left input operator and the right input operator and stores those records in a left and right temporary table. The JoinOperator abstract class, which all the join algorithms extend, provides several methods for accessing record or page iterators from the left and right temporary tables along with other useful information. The SNLJIterator inside SNLJOperator reads records from the left and right operators in a nested loop fashion and joins the appropriate records.

Page nested loop join

For the first part of this project you’ll need to implement the PNLJIterator inside PNLJOperator. Similar to the SNLJIterator, this iterator should join records from the left and right source operators using a nested loop method. As discussed in class, the difference in this join algorithm compared to simple nested loop join is that you iterate over the right table once for every page in the left table instead of every record in the left table. Don’t forget that the first page of a page iterator may be a header page!

We strongly recommend you write helper methods that retrieve the next left record and the next right record when you need to get them. The class has a lot of member variables already given, which serve as a guide for what you might need. You may also add any other member variables as you see fit. The order in which records are yielded is part of the tests, so make sure you are following the algorithms from lecture correctly.

It would also be helpful to familiarize yourself with the implementation of SNLJIterator and TableIterator before starting to write any code for this section.

Block nested loop join

Once you have implemented page nested loop join, you should be able to extend this logic to implement block nested loop join. You’ll need to implement the BNLJIterator inside BNLJOperator specifically. The algorithm should iterate over the right table once for every block of B2 pages in the left table instead of every page in the left table. B represents the total number of memory pages available to the query, and is accessible from the numBuffers field.

Grace hash join

For the last chunk of this section you’ll need to implement grace hash join in the GraceHashIterator inside GraceHashOperator. For grace hash join, you can make the simplifying assumption that we don’t need to do recursive partitioning i.e. you will only partition the inputs once and not multiple times. On the first phase of the algorithm, each record is hashed into its corresponding partition by being added to a temporary table that represents that partition. For an example of this, you can look inside the GroupByOperator class. On the second phase, you can build an inmemory hash table using Java’s builtin HashMap class.

Testing

Once you’ve implemented all of these methods, you should be passing all of the tests in JoinOperatorTest. We strongly recommend you start writing tests once you’ve wrapped your head around the code to try to catch some of the edge cases that you might have missed. It’s generally a good idea to write your own tests as you go along due to the fact that the given tests don’t cover all edge cases.

Part 2: Query Optimization

The second part of the project is focused on the query optimizer. We have provided some structure to help you get started. The QueryPlan#executeOptimal method runs the query optimizer and has been implemented already. This method calls other methods which you’ll need to implement in order to make sure the optimizer works correctly.

As an overview of what this method does, it first searches for the lowest cost ways to access each single table in the query. Then using the dynamic programming search algorithm covered in lecture, it will try to compute a join between a set of tables and a new table if there exists a join condition between them. The lowest cost join of all the tables is found, and then a group by and select operator is applied on top if those are specified in the query. The method returns an iterator over the final operator created. For an example of the naive query plan generation, look at the code inside QueryPlan#execute(). Note that the query optimizer code will look quite different from the naive code, but the naive code still serves as a good example of how to compose a query plan tree.

Cost estimation

The first part of building the query optimizer is ensuring that each query operator has the appropriate IO cost estimates. In order to estimate IO costs for each query operator, you will need the table statistics for any input operators. This information is accessible from the QueryOperator#getStats method. The TableStats object returned represents estimated statistics of the operator’s output, including information such as number of tuples and number of pages in the output among others. These statistics are generated whenever a QueryOperator is constructed.

All of the logic for estimating statistics has been fully implemented except for the calculation of the reduction factor of a WHERE predicate. You must implement the IntHistogram#computeReductionFactor method which will return a reduction factor based on the type of predicate given. The reduction factor calculations should be the same as those that were taught in class.

Each type of QueryOperator has a different estimateIOCost method which handles IO cost estimation. You will be implementing this method in a few of the operators. This method should estimate the IO cost of executing a query plan rooted at that query operator. It is executed whenever a QueryOperator is constructed, and afterwards the cost of an operator can be accessed from the QueryOperator#getIOCost method.

Several operators already have their estimateIOCost methods implemented. In this project, you are only responsible for implementing this method in IndexScanOperator, SNLJOperator, PNLJOperator, BNLJOperator, and GraceHashOperator. For the index scan cost, assume an unclustered index is used. The methods in Transaction and TableStats#getReductionFactor will be useful for implementing the index scan cost. For the grace hash join cost, assume there is only one pass of partitioning.

After implementing all the methods up to this point, you should now have everything you need to start building the search algorithm.

Single table access selection Pass 1

The first part of the search algorithm involves finding the lowest cost plans for accessing each single table in the query. You will be implementing this functionality in QueryPlan#minCostSingleAccess. This method takes in a single table name and should first calculate the estimated IO cost of performing a sequential scan. Then if there are any eligible indices that can be used to scan the table, it should calculate the estimated IO cost of performing such an index scan. The QueryPlan#getEligibleIndexColumns method can be used to determine whether there are any existing indices that can be used for this query. Out of all of these operators, keep track of the lowest cost operator.

Then as part of a heuristicbased optimization we covered in class, you should push down any selections that correspond to the table. You should be implementing the push down select functionality in QueryPlan#pushDownWheres which will be called by the QueryPlan#minCostSingleAccess method.

The end result of this method should be a query operator that starts with either a SequentialScanOperator or IndexScanOperator followed by zero or more WhereOperator’s.

After implementing all the methods up to this point, you should be passing all of the OptimalQueryPlan tests that do not involve any joins.

Join selection Pass

The next part of the search algorithm involves finding the lowest cost join between each set of tables in the previous pass and a separate single table. You will be implementing this functionality in QueryPlan#minCostJoins. Remember to only consider leftdeep plans and to avoid creating any Cartesian products. Use the list of explicit join conditions added through the QueryPlan#join method to identify potential joins. Once you’ve identified a potential join between a left set of tables and a right table, you should be considering each type of join implementation in QueryPlan#minCostJoinType which will be called by the QueryPlan#minCostJoins method.

The end result of this method should be a mapping from a set of tables to a join query operator that corresponds to the lowest cost join estimated.

Testing

If you’ve completed all the sections up to this point, you should now be passing all of the tests that we’ve given you. Again, we strongly encourage you to write tests as you go to try to catch any relevant bugs in your implementation.

Part 3: Testing

We can’t emphasize enough how important it is to test your code! Like we said earlier, writing valid tests that test actual code i.e., don’t write assertEquals(true,true);, or we’ll be mad at you is worth 10% of your project grade. CS 186 is a design course, and validating the reasonability and the functionality of the code you’ve designed and implemented is very important. We suggest you try to find the trickiest edge cases you can and expose them with your tests.

Testing that your code does exactly what you expect in the simplest case is good for sanity’s sake, but it’s often not going to be where the bugs are.

To encourage you to try to find interesting edge cases, we’re going to give you some extra credit. If you find edge cases that most other students didn’t find, you can get up to 5 points of extra credit. Keep in mind that if some other people find the same edge cases as you, that doesn’t mean you won’t get extra credit so don’t treat this as a competition!

Part 4: Feedback

We’ve been working really hard to give you guys the best experience possible with these new projects. We’d love to improve on them and make sure we’re giving reasonable assignments that are helping you learn. In that vein, please fill out this Google Form to help us understand how we’re doing!