SQL代写:CMPT454 Query Optimization

回答SQL中关于Query Optimization的练习题。

Query Optimization

Question 1

Answer the questions that follow about the SQL query shown below.

1
2
3
4
5
6
7
SELECT p.lastname, p.income, p.birthdate, op.sin, op.opdate
FROM Patient p, Operation op, Doctor doc
WHERE (p.city = 'Vancouver' OR p.lastname = 'Smith')
AND (p.city = 'Vancouver' OR p.firstname = 'bob')
AND p.msp = op.msp AND op.sin = doc.sin
AND (op.description = 'appendectomy' OR op.description = 'amputation')
AND doc.doctorname <> 'Smith' AND gender = 'F')

Sketch a relational algebra expression tree that represents an equivalent logical plan to your answer to the SQL query shown below. Your query should be derived by using the following heuristics:

  • Replace Cartesian products with joins
  • Push selections as far down the tree as possible
  • Push projections as far down the tree as possible, except after selections, add attributes to the projections where necessary to achieve this
  • Group join operators

Question 2

You are to estimate the cost of the relational algebra query shown below.

cname,cemail,vname,vcity,ceo,pname,ptype,description,sdate (
vname,vcity,ceo (vcountry='Canada' employees > 500 (Vendor)) ⨝
pid,pname,description,ptype (Product) ⨝
cid,cname,cemail (ccity = 'Vancouver' ccity = 'Edmonton' (Customer)) ⨝ Sales)

Indexes

  • Secondary B+ tree on {vcountry, vcity}, of height 3 in Vendor
  • Primary B+ tree on {ccity, cstreet, cnumber}, of height 4 in Customer
  • Linear hash index on cid in Sales, no bucket consists of more than one block
  • Primary B+ tree index of on pid in Sales
  • For B+ tree indexes assume that the root node of the index is retained in main memory
  • Records of all tables, including intermediate relations. should not span more than one block

Relation Statistics

Customer Product Vendor Sales
T(R) 200,000 100,000 50,000 10,000,000
B(R) 25,000 10,000 5,000 500,000
V(R, cid) 200,000 200,000
V(C, ccity) 100
V(R, pid) 100,000 100,000
V(R, vname) 50,000 50,000
V(V, vcountry) 5
V(V, employees) 5,000

Attribute, Block and Main Memory Sizes

  • cid, pid, employees - 4 bytes
  • ptype, sdate - 8 bytes
  • cname, ccity, pname, vname, vcity, ceo, vcountry - 16 bytes
  • cemail - 32 bytes
  • description - 64 bytes
  • Block (and main memory frame) - 4,096 bytes
  • There are 600 main memory frames available for the query

  • a) Estimate the cost for performing the selection and projection on Vendor, do not include the cost for writing out the result (if any): vname,vcity,ceo ( vcountry=’Canada’ employees ] 500 (Vendor)). The minimum and maximum values for the employees attribute are 0 and 5,000.

  • b) State the size of the Vendor selection and projection result in records and bytes.
  • c) Estimate the cost for performing the selection and projection on Customer, do not include the cost for writing out the result (if any): cid,cname,cemail ( ccity = ‘Vancouver’ ccity = ‘Edmonton’ (Customer)). Histograms kept by the DB indicate that 25% of customers live in Vancouver and 15% live in Edmonton.
  • d) State the size of the Customer selection and projection result in records and bytes.
  • e) State the size of the Product projection result in records and bytes.
  • f) Determine the most efficient join order using the dynamic programming approach described in class. The metric you use to determine join order should be the least number of records. You should not consider the actual cost of performing the join in your calculation, or the cost of producing input for the join. You should only consider left-deep join trees.
    You can do this informally, and do not have to submit the tables shown in the class example (and Exercise 9). Don’t forget that the size in records of some of the tables is affected by previous operations.
    Your answer should consist of the following:
    • The order in which relations are to be joined. e.g. (((A B) C) D)
    • The total cost estimate (the sum of the intermediate relation sizes) for your chosen join order, broken down by join. e.g. Total cost: AB cost + ABC cost = total cost
    • The size in records of each of the three joined relations. e.g. AB, ABC and ABCD size
    • The size in blocks of each of the three joined relations. e.g. AB, ABC and ABCD size
  • g) For the first of the three joins specified in part (f) you are to calculate the cost in reads and writes of performing the join.
    Note that
  • Your cost estimate should only include the cost of performing the join and should not include the cost of previous operations or the cost of writing out the result.
    • Your cost estimate should not be the same as the result of part (f) - the value computed in (f) is used only to determine the join order.
    • You should only consider three join algorithms: block nested loop join, hash join and index nested loop join and must state the cost for each of these algorithms. If an index nested loop join is not possible (because there is no relevant index) state this.
  • h) For the second of the three joins specified in part (f) you are to calculate the cost in reads and writes of performing the join. Read the description in part (g) for more information.
  • i) For the third of the three joins specified in part (f) you are to calculate the cost in reads and writes of performing the join. Read the description in part (g) for more information.
  • j) Estimate the total cost of performing the query. This is mostly adding up the costs from previous parts of the question, but you should note whether or not operations can be pipelined. If not, you should include the cost of materializing the results of intermediate operations.

Assessment

The assignment is out of 46. Marks are assigned as follows:

  • Question 1 - 10
  • Question 2 - 36