WL#6510: Print cost and amount of data processes in EXPLAIN JSON
Status: Complete
To provide to a user a better view on the query plans we can enhance JSON EXPLAIN by printing following info: .) total query cost .) cost, per table .) amount of data processed - (# of rows) * (record width) in G/M/K This will make it easier for a user to see the difference between good and bad plans. User Documentation ================== http://dev.mysql.com/doc/relnotes/mysql/5.7/en/news-5-7-2.html
Functional requirement: ---------------------- EXPLAIN JSON should print cost for following: *) "query_cost" - total cost of a query block, no matter top query or a subquery. *) "sort_cost" - cost of the first sorting operation (GROUP BY or ORDER BY) where and if filesort is used. Cost for second sorting operation isn't printed as optimizer doesn't calculate it. *) "read_cost" - cost of reading for each table used in the query block. *) "eval_cost" - cost of condition evaluation for each table in the query block. *) "prefix_cost" - cost of executing prefix join in the query block. *) "data_read_per_join" - estimated amount of data from the table processed during single query block execution. *) "rows_produced_per_join"/"rows_examined_per_scan" estimated number of records from the table (per each table from the query block) produced/examined per single query block execution. *) "used_columns" - list of columns of the table (per each table in the query block) used for either read or write in the query. All new fields, except "used_columns", are wrapped in the "cost_info" JSON object (see HLS for an example). Non-functional requirements: --------------------------- Costs are printed up to 2 digits after dot, since this is the planner's precision. If the cost value is bigger that 1E14 it's printed in the exponential notation as '%.15g' which will give e.g 1.12345678901234E14. When server provides value for the 'last_query_cost' variable it should be the same as the 'query_cost' value provided by EXPLAIN JSON.
The goal of this WL is to get a low hanging fruit of printing cost we _already_ have, without introducing any new cost calculations. Since cost estimations are done only for SELECT based queries, cost info isn't printed for single table INSERT/UPDATE/DELETE. New info to be printed: 1) "query_cost" total [sub]query cost per one execution. Wrapped in the "cost_info" node, child of the "query_block" node. For the top-level select should be equal to the 'last_query_cost' status variable. 2) "sort_cost" cost of filesort, independently of algorithm. Wrapped in the "cost_info" node, child of the "ordering_operation" node. 3) "read_cost" cost of retrieving data from table, i.e access method cost. 4) "eval_cost" cost of condition evaluation. 5) "prefix_cost" cost of executing prefix join, i.e. cost of joining tables of the query block from the first one to the one (and including it) for which the value is given. All costs are printed up to 2 digits after dot, since this is the planner's precision. If the cost value is bigger that 1E14 it's printed in the exponential notation as '%.15g' which will give e.g 1.12345678901234E14. 6) "data_read_per_join" amount of data pumped through handler interface per one [sub]query execution. Essentially this is record width * number of read records. 3-6 are wrapped in the "cost_info" node, which is a child of the "table" node. 7) "rows_produced_per_join" number of records returned by prefix join. This value takes into account how much record are expected to be filtered out. "rows_examined_per_scan" ex-"rows", renaming is done only for JSON format. It shows how many rows needs to be processed prior to filtering. Both are placed inside "table" node. 8) "used_columns" the list of table's columns used, is a child of the "table" nodes. 1-5 is the exact representation of our current cost model, those are calculated by optimizer. 6 is the upper bound of how much data is processed by the server. Real value would be lower due to not all fields from tables would be used in the query. 7 is an estimate used by the optimizer to calculate cost. 8 would give an idea how much of data processed by the server is actually used. Due to EXPLAIN limitation, the cost info wouldn't be printed for INFORMATION_SCHEMA tables. Examples: *) explain format=json SELECT f1 as a,f21 as b FROM t1,t2 WHERE f1=f21 ORDER BY f21; { "query_block": { "select_id": 1, "cost_info": { "query_cost": 48.423 "data_read": "xxx" << reason to have "cost_info", to be added (and discussed) in another WL }, "ordering_operation": { "using_temporary_table": true, "using_filesort": true, "cost_info": { "sort_cost": 36 "data_read": "xxx" << reason to have "cost_info", to be added (and discussed) in another WL }, "nested_loop": [ { "table": { "table_name": "t1", "access_type": "ALL", "rows_examined_per_scan": 6, "rows_produced_per_join": 6, "filtered": 100, "cost_info": { "read_cost": 2.010, "eval_cost": 1.200, "prefix_cost": 3.210, "data_read_query": "48" }, "used_columns": [ "f1" ] } }, { "table": { "table_name": "t2", "access_type": "ALL", "rows_examined_per_scan": 6, "rows_produced_per_join": 36, "filtered": 100, "using_join_buffer": "Block Nested Loop", "cost_info": { "read_cost": 2.013, "eval_cost": 7.200, "prefix_cost": 12.424, "data_read_query": "576 " }, "used_columns": [ "f21" ], "attached_condition": "(`test`.`t2`.`f21` = `test`.`t1`.`f1`)" } } ] } } } *) explain format=json SELECT f31 FROM ( SELECT f31,f32 FROM t3 WHERE f33 > 1) AS v3 WHERE f32 = 1; { "query_block": { "select_id": 1, "cost_info": { "query_cost": 1.199 }, "table": { "table_name": "v3", "access_type": "ref", "possible_keys": [ "" ], "key": " ", "used_key_parts": [ "f32" ], "key_length": "5", "ref": [ "const" ], "rows_examined_per_scan": 1, "rows_produced_per_join": 1, "filtered": 100, "cost_info": { "read_cost": 1.000, "eval_cost": 0.200, "prefix_cost": 1.200, "data_read_query": "16 " }, "used_columns": [ "f31", "f32" ], "materialized_from_subquery": { "using_temporary_table": true, "dependent": false, "cacheable": true, "query_block": { "select_id": 2, "cost_info": { "query_cost": 11.751 }, "table": { "table_name": "t3", "access_type": "ALL", "rows_examined_per_scan": 48, "rows_produced_per_join": 48, "filtered": 100, "cost_info": { "read_cost": 2.152, "eval_cost": 9.600, "prefix_cost": 11.752, "data_read_query": "768 " }, "used_columns": [ "f31", "f32", "f33" ], "attached_condition": "(`test`.`t3`.`f33` > 1)" } } } } } }
The WL is essentially about printing values we already have: 1) "query_cost": The value of JOIN::best_read. Printed by Explain_join::shallow_explain(). 2) "sort_cost": Optimize_table_order::consider_plan now saves a cost estimate for filesort into the new JOIN::sort_cost variable. Printed by Explain_join::shallow_explain(). 3) "read_cost": The value of JOIN_TAB::position->read_time. 4) "eval_cost": JOIN_TAB::position->prefix_record_count * ROW_EVALUATE_COST. 5) "prefix_cost": The value of JOIN_TAB::position->prefix_cost.get_io_cost() 6) "data_read_per_join": JOIN_TAB::position->prefix_record_count * JOIN_TAB->table->s->rec_buff_length. 7) "rows_produced_per_join": The value of JOIN_TAB::position->prefix_record_count. "rows_examined_per_scan": The key is only renamed, its value is calculated as before. 3-7 printed by Explain_join::explain_rows_and_filtered() 8) "used_columns" an additional loop in explain_join::explain_extra() goes through all fields that has appropriate bit set in either TABLE::read_set or TABLE::write_set and adds the field's name to the "used_columns" array.
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.