Data structure and algorithm principle behind MySQL index

Recommended for you: Get network issues from WhatsUp Gold. Not end users.

Written on the front

There is a well known law "program = data structure + algorithm in programming", I don't agree with this sentence (because I think that the procedure is not only the data structure and algorithm), But in daily study and work I confirm that I deeply feel the importance of data structures and algorithms, A lot of things, If you want to just dig a little deeper, So coming must be various data structures and algorithms of knowledge. For example, almost every programmer should deal with the database, if only to save data, building form, Jian Jian index, doing the crud, then perhaps think of data structure and it's not what relation. But if one day be prompted by a sudden impulse, want to know more, want to study how to optimize the database, then avoid the principle of index can not, if you want to truly understand the index are and how they work, how to reasonable use index to optimize the database, then inevitably entangled in a heap data structure and algorithm. Therefore, if the "core = data structure + algorithm" program I was very positive, but to become a master programmer, the core foundation will go to learn program.

OK, Said so much, In fact, I mean if you want to learn as clear as noonday database index, It must be the data structure and algorithm as a starting point to learn, Unfortunately I have not yet found the database index from the principle of level of information on the Internet (here refers only to not found in popular data field, Academic papers are not included), Not that there is no high level programmers, Just in our company can drive the point home explain database is also the sea to go, Just because of the busy and personal interest, These cattle have no time or no interest to write this article. As a result of the need, my Bantong water programmers this time also hastily research about MySQL database indexing things, although compared to this aspect to understand the big difference too far, but here I am still the shallow knowledge is summarized.

Abstract

The structure and algorithm of data base

The essence of the index

B-Tree and B+Tree

Why use B-Tree(B+Tree)

MySQL index

MyISAM index

InnoDB index

Index and optimization strategy

The sample database

The left prefix principle and related optimization

Index selectivity and prefix index

The primary key of the InnoDB selection and insertion optimization

Postscript

Reference

Abstract

Taking the MySQL database as the research object, some topics related to database indexing. In particular, MySQL supports many storage engine, and supports a variety of storage engine to index also each are not identical, so the MySQL database to support multi index types, such as BTree index, hash index, full-text indexing etc. To avoid confusion, this article will focus on the BTree index, because this is mainly dealing in common use of MySQL index, the hash indexing and full-text index this paper does not discuss.

The main content of the article is divided into four parts.

The first part discusses the mathematical basis of MySQL database indexing mainly from the data structure and algorithm theory.

The second part combined with the MySQL database in the MyISAM and InnoDB data in the storage engine architecture to realize the clustered index index is discussed, a non clustered index index and cover topics such as.

The third part according to the above theoretical basis, discusses the high performance using Indexing Strategy of MySQL.

The structure and algorithm of data base

The essence of the index

The index is defined as the official MySQL: Index (Index) is to help MySQL and efficient access to the data structure of data. Extracting sentence trunk, can get the index: index data structure is the essence of.

We know, database query is one of the most important function of database, for example, the following SQL statement:

 

 
SELECT * FROM my_table WHERE col2 = '77'

From the table "my_table" in "col2" for "77" data record.

We all want to query data speed can be as fast as possible, so the database system designers to optimize query algorithm from the point of view. The most basic query algorithm is of courseSequential search(linear search), Through "my_table" then progressive matching the value of "col2" is "77", this complexity is O (n) algorithm in the volume of data is large is obviously bad, provides much better search algorithm in the development of computer science, for exampleTwo search(binary search), Two fork tree search(binary tree search)Etc.. If you analyze will find, each of the search algorithm can be applied to a specific data structure, such as the two search request is retrieved data orderly, while the two fork tree search can only be applied toTwo binary search treeOn the organizational structure of the data itself, but can not fully meet the various data structures (e.g., theoretically impossible while the two column sequential organization), so, In the data, the database system has also maintained to meet the special data structure search algorithm, the data structure reference in some way (to) data, which can realize the advanced search algorithms on these data structures. This kind of data structure, is the index.

Look at an example:

image

Figure 1

Figure 1 shows a possible index. On the left is the data table, a total of two rows of seven records, the left is the physical address of the data record (note that logically adjacent recorded on the disk is not necessarily physical adjacent). In order to speed up the Col2 search, two binary search tree can maintain one shown on the right, each node includes index key and a pointer to the corresponding data record the physical address of the pointer, so that it can use two binary search in O (log2n) complexity in access to the corresponding data.

Although this is a genuine goods at a fair price index, but the real database system almost no use two binary search tree or the evolution of species tree (red-black tree) implementation, causes in here.

B-Tree and B+Tree

At present most of the database system and file system using B-Tree or its variants of B+Tree as the index structure, in the next section will be combined with the memory principle and computer access principle to discuss why B-Tree and B+Tree are used widely in the index, this section first simply from the data structure to describe them.

B-Tree

In order to describe B-Tree, first define a data record for a two byte [key, data], key as the recording key, for different data recording, key are not the same; data records except key data. B-Tree is a data structure that meet the following conditions:

  1. D is a positive integer greater than 1, called the B-Tree degree.
  2. H is a positive integer, called B-Tree height.
  3. Each non leaf node is composed of n-1 key and N pointer, wherein D<=n<=2d.
  4. Each leaf node contains at least one key and two pointer, contain a maximum of 2d-1 key and 2D pointer, pointer leaf nodes are null .
  5. All leaf nodes have the same depth, equal to the tree to.
  6. Key and pointer are spaced, nodes at the two ends is a pointer.
  7. A node in the key from left to right non descending.
  8. All nodes in a tree structure.
  9. Each pointer is either null, or to another node.
  10. If a pointer in the left most node node is not null, then the node of all key less than V (key1), where V (key1) for the first key node value.
  11. If a pointer in the right most node node is not null, all key the node is greater than V (KeyM), where V (KeyM) for the last key node value.
  12. If a pointer in the node node about the adjacent key respectively is Keyi and keyi+1 and not null, then the node of all key less than V (keyi+1) is greater than V(keyi).

Figure 2 is a schematic diagram of a d=2 B-Tree.

image

Figure 2

Because of the characteristics of B-Tree, in B-Tree according to key data retrieval algorithm is very straightforward: firstly, two search from the root node, if found, return the corresponding node of the data node, otherwise recursive pointer to the corresponding interval pointing to find, until you find the node or find a null pointer, the former to find success, the latter lookup failure. B-Tree search algorithm in pseudo code as follows:

BTree_Search(node, key) 

{

    if(node == null) return null;

    foreach(node.key)

    {

        if(node.key[i] == key) return node.data[i];

        if(node.key[i] > key) return BTree_Search(point[i]->node);

    }

    return BTree_Search(point[i+1]->node);

}

data = BTree_Search(root, my_key);

There is a series of interesting properties of B-Tree, such asA D for the B-Tree, the N key index, the tree to limit is logd ((N+1) /2), the retrieval of an key, the search node number asymptotic complexity of O(logdN). From this point can be seen, B-Tree is a very efficient index data structures.

In addition, Due to the nature of the insert or delete new data recording will damage the B-Tree, Therefore in the insertion deletion, The need for tree a split, merger, transfer and other operations to keep the B-Tree properties, This paper does not intend to complete discussion of these content B-Tree, Because there have been many detailed describes the mathematical properties of B-Tree and insertion deletion algorithm, Interested friends can find the corresponding data in the reference column in the end to read.

B+Tree

B-Tree has many variants, one of the most common is B+Tree, such as MySQL will generally use B+Tree to realize its index structure.

Compared with B-Tree, B+Tree has the following differences:

  1. Each node pointer limit is 2D instead of 2d+1.
  2. Internal nodes do not store data, stores only the leaf nodes do not store a pointer to key.

Figure 3 is a simple B+Tree.

image

Figure 3

Since not all nodes have the same domain, so the B+Tree node and the nodes in the middle size different. Unlike B-Tree, although different node stored in B-Tree key and pointer may be the number of inconsistencies, but each node of the domain and the cap is consistent, so B-Tree in the realization of equal size often apply for each node space.

In general, B+Tree is more suitable for external storage index structure than B-Tree, the specific reasons and external memory principle and computer access principle, which will be discussed below.

With sequential access pointer B+Tree

The general structure of B+Tree used in the database or file in the system on the foundation of the classic B+Tree were optimized, increase the sequential access pointer.

image

Figure 4

As shown in Figure 4, adding a point to the adjacent leaf node pointer in each leaf node B+Tree, was formed with the sequential access pointer B+Tree. The optimization objective is to improve the performance of interval access, for example in Figure 4, if you want to query the key from 18 to 49 of all data records, when found after 18, only the nodes and pointer traversal sequence can be a one-time access to all the data nodes, which referred to the area between the query efficiency.

This section of the B-Tree and B+Tree for a simple introduction, the next day with the memory access principle why the B+Tree is the first choice of data structure of database index.

Why use B-Tree(B+Tree)

As mentioned above, the red black tree data structure can also be used to realize the index, but the file system and database systems generally use B-/+Tree as the index structure, this section will be combined with the composition principle of computer related knowledge about B-/+Tree as the theoretical basis of index.

Generally speaking, the index itself is very big also, not all stored in memory, so the index are stored in the index file on disk. In this way, the index search process to produce the disk I/O consumption, relative to the memory access, I/O access to the high consumption of several orders of magnitude, so an evaluation index data structure as the most important indicator of quality is the disk in the search process in the I/O operating frequency asymptotic complexity. In other words, structure index to minimize the search process in the disk I/O access times. The first introduces the memory and disk access principle, then combined with the analysis of B-/+Tree as an index of efficiency.

Memory access principle

At present, computer memory is the basic random access memory (RAM), the structure and principle of modern RAM is more complex, this paper give the specific differences, abstract out a very simple access model to explain the working principle of RAM.

image

Figure 5

From the abstract point of view, the main memory is matrix storage unit a series of composition, each storing data of fixed size. Each memory cell has a unique address, addressing modern rules of main memory is more complex, here it was simplified into a 2D address: by a row address and a column address can only to a storage unit. Figure 5 shows the memory model of a 4 x 4.

Access memory:

When the system needs to read the memory, the address on the address bus to upload to signal of main memory, main memory read address signal, the analytic signal and locate the specified storage unit, then the memory cell data on the data bus, for other parts of reading.

A similar process write memory, system will write unit address and data are on the address bus and data bus, memory reads two bus, do write the corresponding.

Here you can see, the main memory access time increases linearly with the access number, because there is no mechanical operation, two access data "distance" will not have any impact on the time, for example, take A0 and A1 and take A0 and D3 time consumption is the same.

Disk access principle

As mentioned above, the index is generally in the form of a document stored on disk, indexing requires disk I/O operations. With the main memory, disk I/O mechanical movement cost, so the disk I/O time consumption is huge.

Figure 6 is a schematic diagram of the overall structure of the disk.

image

Figure 6

A disk is composed of the same size and the coaxial circular disk, disk can rotate (each disk must be synchronous rotation). A head support in one side of the magnetic head bracket fixed disk, a group of head, each head is responsible for access to a disk contents. The head can not rotate, but can move along the disk radius direction (actually oblique motion), each head the same time must also be coaxial, namely from the positive direction, all head all overlapping (but now there are many head independent technology, can be not affected by this restriction).

Figure 7 is a schematic diagram of the disk structure.

image

Figure 7

The disk is divided into a series of concentric rings, the center is the center, each concentric ring called a track, all the same radius track consists of a cylindrical. Tracks are along the radius line into small segments, each segment is called a sectors, each sector is the smallest unit of storage disk. For the sake of simplicity, we assume that the disk is only one disc and a magnetic head.

When you need to read data from the disk, the system will be the data logical address to the disk, disk control circuit according to addressing logic logical address to a physical one, that is sure to read data in which the track, which sector. In order to read this sector data, need to head into this sector at the top, in order to achieve this, the head needs to move on the track, this process is called seek, time is called the seek time, then disk rotation will target sector rotation to the head, this time is called the rotation time.

Principle of locality and disk read ahead

Because of the characteristics of the storage medium, the disk itself is much slower than accessing main memory, coupled with the mechanical motion wastes, disk access speed is often a hundred percent of main memory, so in order to improve efficiency, to reduce disk I/O. In order to achieve this goal, the disk is not strictly required to read, but always look ahead, even if only one byte, the disk will start from this position, sequence length of data read back into memory. The theory is the principle of locality famous computer science:

When a data is available, in the vicinity of the data also will usually be used at.

Required during the execution of a program data are usually more concentrated.

Because the efficiency of sequential disk reads very high (no need to seek time, little rotation time), so for local procedures, reading can improve the efficiency of I/O.

Pre reading length for integer multiples of the page (page). Page is the logical block computer memory management, hardware and operating system are main memory and disk storage area is divided into consecutive blocks of equal size, each of the memory blocks called pages (in many operating system, page size is usually 4K), main memory and disk in page units exchange data. When the program data to be read is not in main memory, triggers a page fault exception, the system will issue a read signal to the disk, the disk will find the starting position data and backward sequential reads a page or a few pages loaded into memory, then the abnormal return, the program to run.

B-/+Performance analysis of Tree index

Performance analysis of B-/+Tree index to finally here.

Said above generally use the number of disk I/O evaluation index structure of quality. From the B-Tree analysis, according to the definition of B-Tree, the retrieval of a maximum need access to a h node. Database system designers cleverly use the disk readahead principle, a node set is equal to the size of a page, so that each node only needs one I/O can fully load. In order to achieve this goal, in the actual implementation of B-Tree also need to use the following tips:

Every time a new node, the direct application of a page space, so that a node physical is stored in a page, and computer storage allocation is page aligned, it implements a node only one I/O.

B-Tree a search up to H-1 I/O (the root node memory resident), asymptotic complexity of O(h)=O(logdN). General in the practical application, the D is a very large number, usually more than 100, so the H is very small (usually less than 3).

In summary, B-Tree is used as the indexing structure efficiency is very high.

The tree of this structure, h is clearly much deeper. Because the nodes logically close (father and son) physics may be far, cannot take advantage of locality, so the red black tree I/O asymptotic complexity is O (H), efficiency is obviously much worse than B-Tree.

The above said, B+Tree is more suitable for the index disk, reason and internal nodes of D. From the above analysis we can see, the better the performance of D is indexed, and bounds on the degree of intra node key to depend on the size of the data:

dmax = floor(pagesize / (keysize + datasize + pointsize)) (pagesize – dmax >= pointsize)

Or

dmax = floor(pagesize / (keysize + datasize + pointsize)) - 1 (pagesize – dmax <pointsize)

Floor indicates rounding down. Due to the B+Tree nodes in the data domain, so we can have a greater degree, has better performance.

This chapter from the angle of theory, discusses the data structure and algorithm and index related, the next chapter will discuss how B+Tree implementation for MySQL index, at the same time, the combination of MyISAM and InnDB storage engine introduced non clustered index and aggregation index of two different index form.

MySQL index

In MySQL, the concept of index belong to the storage engine level, different storage engine realization of the index is different, this article mainly discusses the MyISAM and InnoDB two storage engine index implementation.

MyISAM index

The MyISAM engine uses B+Tree as the index structure, the data domain the leaf node is stored in the data record. Below is a schematic diagram of the MyISAM index:

image

Figure 8

Here set the table a total of three column as the primary key, if we are to Col1, figure 8 is a MyISAM table main index (Primary key). Can be seen that the index file MyISAM save data records. In MyISAM, the primary index and secondary index (Secondary key) there is no difference in the structure, is the main index for key is the only, and the auxiliary index key can be repeated. If we build a secondary index on Col2, the structure of this index as shown below:

image

Figure 9

Also a B+Tree, data domain preserving data recording address. Therefore, Indexing algorithm in the MyISAM is the first in B+Tree search algorithm to search the index, if the specified Key, remove the value of the data field, and then to the value of the data field for the address, read the corresponding data record.

Index MyISAM is also called "non gathering", so named for the clustered index with InnoDB discrimination.

InnoDB index

Although the InnoDB is using B+Tree as the index structure, but the specific implementation is quite different from MyISAM.

The first major difference is that InnoDB data file is the index file. From the above that, the MyISAM index and data files are separated, the index file save the data record. In InnoDB, data file itself is according to an index structure of B+Tree organization, the leaf nodes of data domain and the tree of the preservation of the integrity of the data record. The index of key data is the primary key of the table, so the table data InnoDB file itself is the main index.

image

Figure 10

Figure 10 is the main index InnoDB (also a sketch map data file), can see the leaf nodes contain the complete data record. This index is a clustered index. Because the InnoDB data file itself according to the primary aggregation, So InnoDB requires the table must have a primary key (MyISAM can't), If you do not explicitly specify, The MySQL system will automatically choose a unique identification data recording column as the primary key, If there is no this kind of column, The MySQL automatically for the InnoDB table to generate a hidden field as the primary key, The field length is 6 bytes, Type for the long plastic.

Second and the MyISAM index is different corresponding record key auxiliary index data value of the InnoDB field is stored rather than address. In other words, all the secondary index InnoDB references a primary key as the data domain. For example, figure 11 for the definition of a secondary index on Col3:

image

Figure 11

Here to English characters in the ASCII code as comparison criteria. This implementation makes the clustered index by primary key search is very efficient, but the secondary index search to search two times index: first retrieval auxiliary index for the primary key, and then use the key to obtain the record retrieval in the primary index.

Understand the different storage engine index implementations for the correct use and optimization of index are very helpful, for example, know that InnoDB index, it is easy to see why not recommended for field use long as the primary key, because all the secondary index refers to the main index, the main index for too long will make auxiliary index becomes too large. For example, with the fields of non monotonic as a primary key in InnoDB is not a good idea, because the InnoDB data file itself is a B+Tree, non monotonic primary key will cause the data file in when inserting new records for maintenance of B+Tree and frequent division adjustment, is inefficient, and use the increment field as the primary key is a good choice.

The next chapter will discuss these index related optimization strategy.

Index using strategy and optimization

MySQL optimization mainly includes structure optimization (Scheme optimization) and query optimization(Query optimization). High performance indexing strategies discussed in this chapter mainly belongs to the category of structural optimization. The content of this chapter is based on the theory above, but once you understand the mechanism behind the choice of index, then the high performance strategy becomes a mere reasoning, and can understand the logic behind these strategies.

The sample database

In order to discuss the indexing strategy, requires a data quantity is not small database as an example. The MySQL sample database provided by one of the official document: employees. The database of moderate complexity, and a large amount of data. Below is the database E-R diagram (quoted from the official MySQL manual):

image

Figure 12

The official MySQL documentation on this database pageshttp://dev.mysql.com/doc/employee/en/employee.html. Which introduces the database, and provides the download address and input method, if you are interested to import the database to your MySQL can reference the content.

The left prefix principle and related optimization

The primary condition for the effective use of the index is to know what kind of query will be used to index, this problem and B+Tree "leftmost prefix principle", below the left prefix principle through the example.

Here to talk about the concept of joint index. In the above, We assume that there is only a single column index reference, In fact, The MySQL index can be in a certain order reference multiple columns, This index is called the combined index, The general, A combined index is an ordered tuple<a1, a2, …, an>, The elements are data columns of a table, In fact, to strictly define indexes need to use relational algebra, But I don't want to talk too much about relation algebras, Because that would be very boring, So here is not strictly defined. In addition, a single index can be regarded as a combined index element number for 1 cases.

By using employees.titles as an example, the following check on what index:

SHOW INDEX FROM employees.titles;
+--------+------------+----------+--------------+-------------+-----------+-------------+------+------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Null | Index_type |
+--------+------------+----------+--------------+-------------+-----------+-------------+------+------------+
| titles | 0 | PRIMARY | 1 | emp_no | A | NULL | | BTREE |
| titles | 0 | PRIMARY | 2 | title | A | NULL | | BTREE |
| titles | 0 | PRIMARY | 3 | from_date | A | 443308 | | BTREE |
| titles | 1 | emp_no | 1 | emp_no | A | 443308 | | BTREE |
+--------+------------+----------+--------------+-------------+-----------+-------------+------+------------+

From the results it can be to the main index of the titles table for the <emp_no, title, from_date>, and a secondary index; <emp_no>. In order to avoid multiple index make things complex (MySQL SQL optimizer in multiple index more complex), here we will assist the index drop:

ALTER TABLE employees.titles DROP INDEX emp_no;

So you can concentrate on the analysis of the index PRIMARY behavior.

One case: the column matching.
EXPLAIN SELECT * FROM employees.titles WHERE emp_no='10001' AND title='Senior Engineer' AND from_date='1986-06-26';
+----+-------------+--------+-------+---------------+---------+---------+-------------------+------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+-------+---------------+---------+---------+-------------------+------+-------+
| 1 | SIMPLE | titles | const | PRIMARY | PRIMARY | 59 | const,const,const | 1 | |
+----+-------------+--------+-------+---------------+---------+---------+-------------------+------+-------+

Obviously, when accurate matching according to the index of all columns (here refers to the exact matching "=" or "IN" matching), the index can be used. There's one thing to note, in theory the index is sensitive to the order, but because of the condition of sequential MySQL query optimizer will automatically adjust the where clause to use suitable index, for example, we will be the condition in the where order:

EXPLAIN SELECT * FROM employees.titles WHERE from_date='1986-06-26' AND emp_no='10001' AND title='Senior Engineer';
+----+-------------+--------+-------+---------------+---------+---------+-------------------+------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+-------+---------------+---------+---------+-------------------+------+-------+
| 1 | SIMPLE | titles | const | PRIMARY | PRIMARY | 59 | const,const,const | 1 | |
+----+-------------+--------+-------+---------------+---------+---------+-------------------+------+-------+

The effect is the same.

Case two: the left prefix matching.
EXPLAIN SELECT * FROM employees.titles WHERE emp_no='10001';
+----+-------------+--------+------+---------------+---------+---------+-------+------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+------+---------------+---------+---------+-------+------+-------+
| 1 | SIMPLE | titles | ref | PRIMARY | PRIMARY | 4 | const | 1 | |
+----+-------------+--------+------+---------------+---------+---------+-------+------+-------+

When the query precision matching index left continuous one or several columns, such as <emp_no> or <emp_no, title>, so can be used, but only used one part, which is the most left prefix. The above inquiry from the analysis results using the PRIMARY index, but key_len was 4, indicating only to the first column prefix index.

Case three: the query used exact matching of the columns in the index, but a condition not provided.
EXPLAIN SELECT * FROM employees.titles WHERE emp_no='10001' AND from_date='1986-06-26';
+----+-------------+--------+------+---------------+---------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+------+---------------+---------+---------+-------+------+-------------+
| 1 | SIMPLE | titles | ref | PRIMARY | PRIMARY | 4 | const | 1 | Using where |
+----+-------------+--------+------+---------------+---------+---------+-------+------+-------------+

The index and two identical, because title is not provided, so the query using only the first column index, and the from_date are in the index, but the title does not exist and cannot be left prefix connected, so the need for scanning and filtering based on from_date (here because emp_no only, so there is no scanning). If you want to not where filtering allows from_date also use the index and a secondary index, can increase <emp_no, from_date>, the above query will use this index. In addition, also can use a called "optimization method for isolating column", between the emp_no and from_date "pits" fill.

First we see title there are several different values:

SELECT DISTINCT(title) FROM employees.titles;
+--------------------+
| title |
+--------------------+
| Senior Engineer |
| Staff |
| Engineer |
| Senior Staff |
| Assistant Engineer |
| Technique Leader |
| Manager |
+--------------------+

Only 7. In such as "pit" column value in relatively few cases, can consider to use "IN" to fill the "pit" to form the leftmost prefix:

EXPLAIN SELECT * FROM employees.titles
WHERE emp_no='10001'
AND title IN ('Senior Engineer', 'Staff', 'Engineer', 'Senior Staff', 'Assistant Engineer', 'Technique Leader', 'Manager')
AND from_date='1986-06-26';
+----+-------------+--------+-------+---------------+---------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+-------+---------------+---------+---------+------+------+-------------+
| 1 | SIMPLE | titles | range | PRIMARY | PRIMARY | 59 | NULL | 7 | Using where |
+----+-------------+--------+-------+---------------+---------+---------+------+------+-------------+

The key_len is 59, the index is full, but from type and rows show that IN actually performs a range query, it examined 7 key. Comparative performance under two kinds of queries:

SHOW PROFILES;
+----------+------------+-------------------------------------------------------------------------------+
| Query_ID | Duration | Query |
+----------+------------+-------------------------------------------------------------------------------+
| 10 | 0.00058000 | SELECT * FROM employees.titles WHERE emp_no='10001' AND from_date='1986-06-26'|
| 11 | 0.00052500 | SELECT * FROM employees.titles WHERE emp_no='10001' AND title IN ... |
+----------+------------+-------------------------------------------------------------------------------+

"Fill the pit after a performance boost. ". If after emp_no screening of rest after a lot of data, the latter performance advantages will be more obvious. Of course, if the value of title for many, fill the pit is inappropriate, must establish a secondary index.

Case four: query without specifying the first column index.
EXPLAIN SELECT * FROM employees.titles WHERE from_date='1986-06-26';
+----+-------------+--------+------+---------------+------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+------+---------------+------+---------+------+--------+-------------+
| 1 | SIMPLE | titles | ALL | NULL | NULL | NULL | NULL | 443308 | Using where |
+----+-------------+--------+------+---------------+------+---------+------+--------+-------------+

Not because of the left prefix, such a query is not indexed to the index.

Case five: matching prefix string a column.
EXPLAIN SELECT * FROM employees.titles WHERE emp_no='10001' AND title LIKE 'Senior%';



+----+-------------+--------+-------+---------------+---------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+-------+---------------+---------+---------+------+------+-------------+
| 1 | SIMPLE | titles | range | PRIMARY | PRIMARY | 56 | NULL | 1 | Using where |
+----+-------------+--------+-------+---------------+---------+---------+------+------+-------------+

At this point you can use index, but if the wildcard is not only in the end, it cannot use the index.

Case six: range queries.
EXPLAIN SELECT * FROM employees.titles WHERE emp_no<'10010' and title='Senior Engineer';
+----+-------------+--------+-------+---------------+---------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+-------+---------------+---------+---------+------+------+-------------+
| 1 | SIMPLE | titles | range | PRIMARY | PRIMARY | 4 | NULL | 16 | Using where |
+----+-------------+--------+-------+---------------+---------+---------+------+------+-------------+

The range column can be used to index (must be the most left prefix), but the range column behind the column cannot use index. At the same time, the index for a range of most listed, so if the query conditions in two columns are not fully used the index range.

EXPLAIN SELECT * FROM employees.titles
WHERE emp_no<'10010'
AND title='Senior Engineer'
AND from_date BETWEEN '1986-01-01' AND '1986-12-31';
+----+-------------+--------+-------+---------------+---------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+-------+---------------+---------+---------+------+------+-------------+
| 1 | SIMPLE | titles | range | PRIMARY | PRIMARY | 4 | NULL | 16 | Using where |
+----+-------------+--------+-------+---------------+---------+---------+------+------+-------------+

You can see the index second index range incapable of action. Here in particular note MySQL an interesting place, it is with explain alone may not be able to distinguish between index range and multi value matching, because in type both displayed as range. At the same time, using the "between" does not mean that the range query, for example, the following query:

EXPLAIN SELECT * FROM employees.titles
WHERE emp_no BETWEEN '10001' AND '10010'
AND title='Senior Engineer'
AND from_date BETWEEN '1986-01-01' AND '1986-12-31';
+----+-------------+--------+-------+---------------+---------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+-------+---------------+---------+---------+------+------+-------------+
| 1 | SIMPLE | titles | range | PRIMARY | PRIMARY | 59 | NULL | 16 | Using where |
+----+-------------+--------+-------+---------------+---------+---------+------+------+-------------+

Looks like it is made of two range queries, but the effect on the emp_no "BETWEEN" is actually equivalent to "IN", emp_no is the actual value is more precise matching. You can see the query used to index all three columns. Therefore in MySQL caution area division multiple value matching and range matching, otherwise they will be on the behavior of MySQL confusion.

Case seven: contains a function or expression in the query conditions.

Unfortunately, if it contains a function or expression in the query conditions, MySQL does not use the column index (although some in the mathematical sense can be used). For example:

EXPLAIN SELECT * FROM employees.titles WHERE emp_no='10001' AND left(title, 6)='Senior';
+----+-------------+--------+------+---------------+---------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+------+---------------+---------+---------+-------+------+-------------+
| 1 | SIMPLE | titles | ref | PRIMARY | PRIMARY | 4 | const | 1 | Using where |
+----+-------------+--------+------+---------------+---------+---------+-------+------+-------------+

Although the query and the fifth function the same, but due to the use of the left function, not as a title column using the index, and the fifth with LIKE can be. Another example:

EXPLAIN SELECT * FROM employees.titles WHERE emp_no - 1='10000';
+----+-------------+--------+------+---------------+------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+------+---------------+------+---------+------+--------+-------------+
| 1 | SIMPLE | titles | ALL | NULL | NULL | NULL | NULL | 443308 | Using where |
+----+-------------+--------+------+---------------+------+---------+------+--------+-------------+

Obviously this query is equivalent to query emp_no 10001 function, but because the query condition is an expression, MySQL unable to use index. It seems that MySQL has no intelligence to automatically optimize the constant expression level, so in the writing queries to avoid expressions to appear in the query, but first hand in algebra, converted to a query without expression.

Index selectivity and prefix index

Since the index can accelerate query speed, so Is it right? As long as the query need, is built on the index? The answer is No. Because the index is to speed up the query speed, but there is a price index: index file itself consumes storage space, at the same time index will add, delete and modify records inserted at the time of the burden, in addition, the MySQL at run time will consume resources to maintain the index, so the index is not the more the better. General two cases do not recommend building index.

The first is the table records is relatively small, such as one thousand or two thousand or even only a few hundred record table, there is no need to build index, let the query as a full table scan is good. As for how many records is too much, some personal views on this person, my personal experience is with 2000 as the boundary, the number of records is not more than 2000 could be considered not indexed, more than 2000 have the discretion to consider the index.

Another suggestion is not indexed index lower selectivity. The selective index (Selectivity), refers to not repeat the index value (also called the base, Cardinality) and records the number (#T) ratio:

Index Selectivity = Cardinality / #T

The range obviously selective for (0, 1], selective higher index of greater value, which is determined by the nature of B+Tree. For example, we used employees.titles table, if the title field is often a separate query, whether to need to build the index, we look at its selectivity:

SELECT count(DISTINCT(title))/count(*) AS Selectivity FROM employees.titles;
+-------------+
| Selectivity |
+-------------+
| 0.0000 |
+-------------+

Selectivity is less than 0.0001 of title (the exact value is 0.00001579), so there is no need to build what index for the individual.

There is a selective and index related index optimization strategy is called a prefix index, is used instead of the whole column as the column prefix index key, when the prefix length is proper, can make selective prefix index close to full column index, because the index key is shorter and reduced the size of index file and maintenance costs. The selection and use of the employees.employees table as an example to introduce the prefix index.

As you can see from Figure 12 employees table has an index <emp_no>, so if we want to search a person by name, only a full table scan:

EXPLAIN SELECT * FROM employees.employees WHERE first_name='Eric' AND last_name='Anido';
+----+-------------+-----------+------+---------------+------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+------+---------------+------+---------+------+--------+-------------+
| 1 | SIMPLE | employees | ALL | NULL | NULL | NULL | NULL | 300024 | Using where |
+----+-------------+-----------+------+---------------+------+---------+------+--------+-------------+

If you frequently by name search employees, so obviously the efficiency is very low, so we can consider building index. There are two options, <first_name> or <first_name, last_name> two, see the selective index:

SELECT count(DISTINCT(first_name))/count(*) AS Selectivity FROM employees.employees;
+-------------+
| Selectivity |
+-------------+
| 0.0042 |
+-------------+
 
SELECT count(DISTINCT(concat(first_name, last_name)))/count(*) AS Selectivity FROM employees.employees;
+-------------+
| Selectivity |
+-------------+
| 0.9313 |
+-------------+

<first_name>Obviously too low selectivity, <first_name, last_name>Good selectivity, But first_name and last_name combined length of 30, There is no both length and selective way? Can be considered the first few characters indexed by first_name and last_name, For example<first_name, left(last_name, 3)>, Have a look the selectivity:

SELECT count(DISTINCT(concat(first_name, left(last_name, 3))))/count(*) AS Selectivity FROM employees.employees;
+-------------+
| Selectivity |
+-------------+
| 0.7879 |
+-------------+

Selectivity is good, but from the 0.9313 or a little distance, so the last_name prefix to 4:

SELECT count(DISTINCT(concat(first_name, left(last_name, 4))))/count(*) AS Selectivity FROM employees.employees;
+-------------+
| Selectivity |
+-------------+
| 0.9007 |
+-------------+

This selectivity has been very satisfactory, and the length of the index is only 18, than the <first_name, last_name> short nearly half, we put the prefix based index:

ALTER TABLE employees.employees
ADD INDEX `first_name_last_name4` (first_name, last_name(4));

And then executed once by name query, comparison analysis and indexing of results:

SHOW PROFILES;
+----------+------------+---------------------------------------------------------------------------------+
| Query_ID | Duration | Query |
+----------+------------+---------------------------------------------------------------------------------+
| 87 | 0.11941700 | SELECT * FROM employees.employees WHERE first_name='Eric' AND last_name='Anido' |
| 90 | 0.00092400 | SELECT * FROM employees.employees WHERE first_name='Eric' AND last_name='Anido' |
+----------+------------+---------------------------------------------------------------------------------+

The performance improvement is significant, the query speed is improved 120 times.

A prefix index balance index size and query speed, but its disadvantage is not used for ORDER BY and GROUP BY, also cannot be used for Covering index (that is, when the index itself contains the query all the required data, not access to the data file itself).

The primary key of the InnoDB selection and insertion optimization

When using the InnoDB storage engine, if no special need, please always use a service independent increment field as the primary key.

Often see a post or blog discusses key problems, it is suggested to use business of auto increment primary key, some people do not feel the need, can use such as student ID or ID the only field as the primary key. No matter which kind of argument, most arguments are service level. If looking from database indexing angle optimization, use the InnoDB engine without the use of auto increment primary key is a bad idea.

The above discussed InnoDB index implementation, using the clustered index InnoDB, data recording itself is stored in the main index (a B+Tree) leaf nodes. This requires the same leaf node (the size of a page of memory or disk pages) of each data record in primary key order store, so when there is a new record is inserted, MySQL will according to its primary key is inserted into the appropriate nodes and position, if the page to load factor (InnoDB default to 15/16), opened a new page (node).

If a table using the auto increment primary key, then each insert new records, records are added to the subsequent position of the current index node will order, when a page write full, it will automatically open a new page. As shown below:

image

Figure 13

This will form a compact index structure, the approximation order fill. Because each insertion are not required to move the existing data, so the efficiency is very high, also won't cost a lot of money in the maintenance of index.

If the use of non auto increment primary key (if Id number or number etc.), because each insertion of primary key values approximate to random, so the new record every time to be inserted into the existing index page to a location in the middle:

image

Figure 14

The MySQL had to be inserted into the right position and new records of mobile data, Even a target page may have been written back to disk from the cache to clear out, At this time to read from disk back, This adds a lot of overhead, Mobile, the paging operation also frequently caused a large number of fragments, Not compact index structure is obtained, Subsequent to the OPTIMIZE TABLE to rebuild the table and fill page optimization.

Therefore, as long as you can, please try to increase since the primary key field based on InnoDB.

Postscript

This article intermittently for half a month, is the main content of the above. Undeniable, this article has empty talk too to a certain extent, because I am on the use of MySQL belongs to the rookie level, but not too much database tuning experience, talk about here database index tuning a little boast without shame. When is my personal a learning notes.

In fact, database index tuning is a technology live, can not simply rely on theory, because the actual situation of the myriads of changes, and MySQL itself has the mechanism is very complex, such as query optimization strategy and implementation of the engine differences will complicate the situation. But at the same time, the theory is the foundation of index tuning, only based on that theory, in order to reasonable inference on the tuning strategy and an understanding of the underlying mechanisms, and then combined with the experiment and exploration practice, so as to achieve the purpose of efficient use of MySQL index.

In addition, MySQL index and its optimization covers very wide scope, this paper only involves one part. As with the sort (ORDER BY) index optimization related and cover index (Covering index) topic this paper did not involve, at the same time, except for the B-Tree index MySQL according to the hash index, different engine support full-text indexing etc. This paper also did not involve. If there is a chance, hope to not involved in this part of the supplement.

Reference

[1] Baron Scbwartz, Wang Xiaodong and other translation; high performance MySQL (High Performance MySQL); Electronic Industry Press, 2010

[2] Michael Kofler, Yang Xiaoyun and other translation; MySQL5 authoritative guide (The Definitive Guide to MySQL5); the posts and Telecommunications Press, 2006

[3] Jiang Chengyao; MySQL technology insider -InnoDB storage engine; Machinery Industry Press, 2011

[4] D Comer, Ubiquitous B-tree; ACM Computing Surveys (CSUR), 1979

[5] Codd, E. F. (1970). "A relational model of data for large shared data banks". Communications of the ACM, , Vol. 13, No. 6, pp. 377-387

[6] The MySQL5.1 reference manual -http://dev.mysql.com/doc/refman/5.1/zh/index.html

In this paper, Attribution - noncommercial 3 license agreement based on release, welcome to reprint, deduction, but must retain the name of Zhang Yang (contains links), and shall not be used for commercial purposes. If you have any questions or licensing agreement, please contact me.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download

Posted by Elijah at May 10, 2014 - 10:55 AM