SQL代写:CS420 Query Optimization

使用MySQL的EXPLAIN命令,完成Query Optimization分析。
MySQL

Introduction

In this assignment, you will carry out a number of exercises involving the optimization of relational queries using the MySQL query optimizer and the visualization command EXPLAIN. You need to read some of the MySQL documentation in Section 2 to be able to complete this assignment. To be more specific, you need to be familiar with EXPLAIN, SHOW PROFILE, ANALYZE, and the INFORMATION_SCHEMA Tables commands of MySQL (specific links are provided in the subsections).

This is a small hands-on project and should be done INDIVIDUALLY. Please read the entire assignment carefully before your beginning.

Useful Documentation

Setup

You need to create a new database and create four tables in this database. We provide the definitions of the tables and the data. You can download the zip file with the table schemas and the data from Piazza.

Relation Schema

We will use four tables in this experiment: part, supplier, partsupp, and lineitem.

  1. part ( p_partkey integer, p_name varchar(55), p_mfgr character(25), p_brand character(10), p_type varchar(25), p_size integer, p_container character(10), p_retailprice numeric(20,2), p_comment varchar(23), primary key (p_partkey));
  2. supplier ( s_suppkey integer, s_name char(25), s_address varchar(40), s_nationkey integer, s_phone character(15), s_acctbal numeric(20,2), s_comment varchar(101), primary key (s_suppkey));
  3. partsupp (ps_partkey integer, ps_suppkey integer, ps_availqty integer, ps_supplycost numeric(20,2), ps_comment varchar(199), primary key(ps_partkey, ps_suppkey));
  4. lineitem( l_orderkey integer, l_partkey integer, l_suppkey integer, l_linenumber integer, l_quantity numeric(20,2), l_extendedprice numeric(20,2), l_discount numeric(3,2), l_tax numeric(3,2), l_returnflag character(1), l_linestatus character(1), l_shipdate date, l_commitdate date, l_receiptdate date, l_shipinstruct character(25), l_shipmode character(10), l_comment varchar(44), primary key (l_orderkey, l_linenumber);

Steps

  1. Create a database (e.g. tpch) and create the tables. You can use the create table statements in the schema.sql file.
  2. Exit mysql and login again to mysql using the following command:
    > mysql --local-infile -uroot -p
    Modify the local-infile variable so that you can run “load data” command later:
    > SET GLOBAL local_infile = true;
    Check the variable “local_infile” is properly set:
    > SHOW VARIABLES LIKE 'local_infile';
    You should see the value of “local_infile” is “ON”
  3. You can load the data using the load command. To load the part table, you do (replace
    “path_in_your_laptop” with the actual path that you store the data files):
    > use tpch; (your database name)
    > load data local infile 'path_in_your_laptop /part.tbl' into table part fields terminated by '|';
  4. You can see also here for more: http://dev.mysql.com/doc/refman/8.0/en/load-data.html

Exercises

In general, use EXPLAIN FORMAT=JSON to get the evaluation plan because it gives much more information about the plan. Use the actual execution of the query on terminal or profile information for query execution times.

To use the Profiles, you need to set the profile on first using: SET profiling = 1;

Statistics of the Tables

We will first examine the statistics for table lineitem. Answer the following questions.

  1. How many records are there actually in “lineitem”? What is the estimated value by the InnoDB optimizer? How do you find these values (command or SQL)?
  2. Is the value used by the query optimizer accurate? If not, why?

Index on Perfect Match Query

We will check how index affects query optimization and performance.
Examine the following query:
SELECT * FROM lineitem WHERE L_TAX = 0.07;

  1. What is the estimated total cost of executing the best plan? What does the cost of a plan mean in MySQL?
  2. What is the estimated result cardinality for this plan? How does the query optimizer obtain this value? Is it a reasonable one?
  3. Which access method (access type) does the optimizer choose?
    Create a B-Tree index “ltax_idx” on the attribute “L_TAX”.
  4. Which access method does the optimizer consider to be the best now? Is the estimated result cardinality better now? Why?
  5. Compare the two plans (with and without index). Explain briefly why the access method in 4. is cheaper than the previous one without index.

Index on Range Select

Consider the following query:
SELECT * FROM lineitem WHERE L_QUANTITY < 45;

  1. How many tuples does the query optimizer think will be returned? What is the estimated total cost?
  2. What is the access method used?
    Consider now the following query:
    SELECT * FROM lineitem WHERE L_QUANTITY < 3;
    Now create a B-Tree index “l_qty_idx” on the attribute “L_QUANTITY”. Run “Explain” command for this query.
  3. What is the estimated total cost now? Does the estimated total cost make sense? Why? In what order would the tuples be returned by this plan?
  4. Explain why one of the access methods is more expensive than the other. [From now on, you may use “Explain current statement” functionality in workbench to check the visualized query plan]

Join Algorithm

Consider the following query:

1
2
3
SELECT DISTINCT (s_name)
FROM supplier, partsupp
WHERE s_suppkey = ps_suppkey AND ps_availqty < 40;

  1. Write down the best plan estimated by the optimizer (in plan tree form). What is the estimated total cost?
  2. What is the join algorithm used in the plan? Explain how the system reads the two relations (what access method is used).
  3. According to the optimizer, how many tuples will be retrieved from partsupp per scan? How many from supplier per scan? Do you consider these numbers a good estimation? Why or why not?
  4. Can you add an index to improve the performance of the query? Which index you will create and on which attribute? What is the new plan that is executed and what is its cost?
  5. After you created the index, check the estimation of the number of tuples retrieved from partsupp. Is it a good estimation? If yes, why?

Three-Way Join

Consider the following query:

1
2
3
SELECT p_name, s_name
FROM part, supplier, partsupp
WHERE s_suppkey = ps_suppkey AND p_partkey = ps_partkey;

  1. Write down the best plan chosen by the optimizer. List the joins and access methods it uses, and the order in which the relations are joined. What is the estimated cost of this plan? What is the actual query execution time of this query?
  2. Can you add an index to improve the query performance? You are free to add any index that you like on any table. What is the new plan now? What is the new estimated cost? Is the new plan really better in practice than the previous one (does the query run faster now)?
  3. Modify the query by adding a condition AND ps_availqty < 10 in the WHERE clause. What are the differences between the current query plan and the one in 4.5.1? Why is the current query plan more efficient than the one in 4.5.1?