Skip to content

Latest commit

 

History

History
375 lines (358 loc) · 24.3 KB

index.md

File metadata and controls

375 lines (358 loc) · 24.3 KB

索引

基本原理

索引类别

索引用于加快数据查询,是一种数据结构。对于InnoDB引擎,索引的实现是B+树。mysql中支持多种类型的索引, 本文只讨论主键索引、普通单列索引、联合索引。

  • 主键索引:主键索引又叫聚簇索引,叶子节点存储的是数据表的某一行数据。表中的primary key会用于创建主键索引。 假设表中id列表示主键,下面是主键索引简单样例说明:

    其中 Px 表示指向下一个磁盘块的指针;数字表示主键值;叶子节点最后两行表示当前行实际存储数据。
                                        磁盘1
                                        +----+----+----+-----+
                                        | P1 | 44 | P2 | 100 |
                                        +----+----+----+-----+
                                           |         |
                      +--------------------+         +----------------------+
           磁盘2      v                                           磁盘3      v
           +----+---+----+----+----+                             +----+----+----+
           | P1 | 8 | P2 | 37 | P3 |                             | P1 | 65 | P2 |
           +----+---+----+----+----+                             +----+----+----+
             |        |          |                                 |          |
          +--+        +-------+   +---------------+                |          +--------+
          v                   v                   v                v                   v
    +-----+-----+       +-----+-----+       +-----+-----+       +-----+       +-----+-----+-----+
    |  2  |  6  | <---> |  8  |  18 | <---> |  37 |  38 | <---> |  44 | <---> |  65 |  78 |  85 |
    +-----+-----+       +-----+-----+       +-----+-----+       +-----+       +-----+-----+-----+
    |  a  |  b  |       |  c  |  d  |       |  e  |  f  |       |  g  |       |  h  |  i  |  j  |
    |  22 |  26 |       |  28 |  38 |       |  47 |  48 |       |  54 |       |  78 |  88 |  95 |
    +-----+-----+       +-----+-----+       +-----+-----+       +-----+       +-----+-----+-----+
       磁盘4                磁盘5               磁盘6            磁盘7             磁盘8

    查询语句如下:

    select * from student where id = 38;

    查询过程如下:

    • 第一次磁盘 IO:从根节点检索,将数据块1加载到内存,比较38 < 44,走左边 P1;
    • 第二次磁盘 IO:将左边数据块2加载到内存,比较8<37<38,走右边 P3;
    • 第三次磁盘 IO:将右边数据块6加载到内存,比较37<38 38=38,查询完毕。
  • 普通单列索引:对某一列创建普通索引,B+树普通索引叶节点不存储数据,只存储数据的主键值。例如对普通列age创建索引, 下面是普通但列索引的样例:

    其中 Px 表示指向下一个磁盘块的指针;数字表示 age 值;叶子节点最后一行表示存储的主键值。
                                        磁盘1
                                        +----+----+----+-----+
                                        | P1 | 54 | P2 | 100 |
                                        +----+----+----+-----+
                                           |         |
                      +--------------------+         +----------------------+
           磁盘2      v                                           磁盘3      v
           +----+----+----+----+----+                             +----+----+----+
           | P1 | 28 | P2 | 47 | P3 |                             | P1 | 78 | P2 |
           +----+----+----+----+----+                             +----+----+----+
             |        |          |                                 |          |
          +--+        +-------+   +---------------+                |          +--------+
          v                   v                   v                v                   v
    +-----+-----+       +-----+-----+       +-----+-----+       +-----+       +-----+-----+-----+
    |  22 |  26 | <---> |  28 |  38 | <---> |  47 |  48 | <---> |  54 | <---> |  78 |  88 |  95 |
    +-----+-----+       +-----+-----+       +-----+-----+       +-----+       +-----+-----+-----+
    |  2  |  6  |       |  8  |  18 |       |  37 |  38 |       |  44 |       |  65 |  78 |  85 |
    +-----+-----+       +-----+-----+       +-----+-----+       +-----+       +-----+-----+-----+
       磁盘4                磁盘5               磁盘6            磁盘7             磁盘8

    查询语句如下:

    select * from student where age = 48;

    使用普通索引需要查询两次索引,第一次查询普通age列索引,找到age=48对应的主键值 38;第二次根据主键值 38 查询主键索引, 获取实际的一行数据,查询主键索引过程称为回表。

    查询过程如下:

    • 第一次磁盘 IO:从age列索引的根节点开始检索,将磁盘块1加载到内存,比较48<54,走左边P1指向的磁盘块;
    • 第二次磁盘 IO:将磁盘块2加载到内存,比较28<47<48,走右边P3指向的磁盘块;
    • 第三次磁盘 IO:将磁盘块6加载到内存,找到age=48对应的主键值38
    • 第四次磁盘 IO:从主键索引根节点检索,将数据块1加载到内存,比较38 < 44,走左边P1
    • 第五次磁盘 IO:将左边数据块2加载到内存,比较8<37<38,走右边P3
    • 第六次磁盘 IO:将右边数据块6加载到内存,比较37<38 38=38,查询完毕。
  • 联合索引:对多个普通列创建普通索引,B+树普通索引叶节点不存储数据,只存储数据的主键值。例如对普通列nameage设置索引, 索引样例如下:

    其中 Px 表示指向下一个磁盘块的指针;字母和数字表示 name 和 age;叶子节点最后一行表示存储的主键值。
                                        磁盘1
                                        +----+----+----+
                                        |    | g  |    |
                                        | P1 | 54 | P2 |
                                        +----+----+----+
                                           |         |
                      +--------------------+         +----------------------+
           磁盘2      v                                           磁盘3      v
           +----+---+----+----+----+                             +----+----+----+
           |    | c |    | e  |    |                             |    | h  |    |
           | P1 | 28| P2 | 47 | P3 |                             | P1 | 78 | P2 |
           +----+---+----+----+----+                             +----+----+----+
             |        |          |                                 |          |
          +--+        +-------+   +---------------+                |          +--------+
          v                   v                   v                v                   v
    +-----+-----+       +-----+-----+       +-----+-----+       +-----+       +-----+-----+-----+
    |  a  |  b  |       |  c  |  d  |       |  e  |  f  |       |  g  |       |  h  |  i  |  j  |
    |  22 |  26 |       |  28 |  38 |       |  47 |  48 |       |  54 |       |  78 |  88 |  95 |
    +-----+-----+       +-----+-----+       +-----+-----+       +-----+       +-----+-----+-----+
    |  2  |  6  | <---> |  8  |  18 | <---> |  37 |  38 | <---> |  44 | <---> |  65 |  78 |  85 |
    +-----+-----+       +-----+-----+       +-----+-----+       +-----+       +-----+-----+-----+
       磁盘4                磁盘5               磁盘6            磁盘7             磁盘8

    查询语句如下:

    select * from student where name = 'f' and age = 48;

    查询过程如下:

    • 第一次磁盘 IO:从name age两列索引的根节点开始检索,将磁盘块1加载到内存,比较f<g,走左边P1指向的磁盘块;
    • 第二次磁盘 IO:将磁盘块2加载到内存,比较c<e<f,走右边P3指向的磁盘块;
    • 第三次磁盘 IO:将磁盘块6加载到内存,找到name=f,同时判断age=48,如果满足,则获取主键值38
    • 第四次磁盘 IO:从主键索引根节点检索,将数据块1加载到内存,比较38 < 44,走左边P1
    • 第五次磁盘 IO:将左边数据块2加载到内存,比较8<37<38,走右边P3
    • 第六次磁盘 IO:将右边数据块6加载到内存,比较37<38 38=38,查询完毕。

原则或优化

  • 最左匹配:在组合索引中,首先按创建组合索引最左边列排序。例如上面name age组合索引,会先以name排序,在name相同的情况下, 以age排序,也就是说name是全局有序的,age是局部有序,全局无序。查询的时候,B+树以name确定搜索方向,如果没有name列, 则B+树无法确定搜索方向,这就是最左匹配原则;

    创建组合索引idx_name_age (name, age),逻辑上可以认为创建列idx_name (name)idx_name_age (name, age)两个索引。

    使用组合索引查询时,mysql会一直向右匹配直至遇到范围查询(>、<、between、like)就停止匹配。

  • 覆盖索引:对于普通索引查询,查询结果所需要的数据只在主键索引上有,所以不得不回表。但如果查询字段在普通索引的叶节点有, 则直接返回,不需要回表,这种叫覆盖索引。例如:

    select age from student where age = 48;

    age字段就是索引的key

执行计划

执行计划样例如下:

mysql> explain select * from task_info where model_num < 200 and cube_num=100;
+----+-------------+-----------+------------+------+----------------------+------+---------+------+---------+----------+-------------+
| id | select_type | table     | partitions | type | possible_keys        | key  | key_len | ref  | rows    | filtered | Extra       |
+----+-------------+-----------+------------+------+----------------------+------+---------+------+---------+----------+-------------+
|  1 | SIMPLE      | task_info | NULL       | ALL  | index_model_cube_num | NULL | NULL    | NULL | 9435886 |     5.00 | Using where |
+----+-------------+-----------+------------+------+----------------------+------+---------+------+---------+----------+-------------+
1 row in set, 1 warning (0.01 sec)

主要字段含义说明如下:

  • id:一组数字,表示一个查询中各个子查询执行顺序。如果id相同,则从上往下执行,如果id不同,值越大,越先被执行;
  • select_type:每个子查询的类型,取值如下:
    • SIMPLE:不包含任何子查询或union等查询;
    • PRIMARY:包含子查询最外层查询就显示为PRIMARY
    • SUBQUERY:在selectwhere字句中包含的查询;
    • DERIVEDfrom字句中包含的查询;
    • UNION:出现在union后的查询语句中;
    • UNION RESULT:从UNION中获取结果集;
  • type:查询类型,部分取值如下:
    • ALL:扫描全表数据;
    • index:遍历索引;
    • range:索引范围查找;
    • index_subquery:子查询中使用ref
    • unique_subquery:子查询中使用eq_ref
    • fulltext:使用全文索引;
    • ref:使用非唯一索引查询;
    • const:使用主键索引或唯一索引查询,且匹配结果只有一条;
    • systemconst的特例;
  • possible_keys:可能使用的索引,但不一定使用,当改列为NULL,需要考虑SQL优化;
  • key:实际使用的索引;
  • Extra:附加信息,取值如下:
    • Using index:使用了覆盖索引,不用回表;
    • Using where:使用where子句对结果集进行筛选;
    • Using temporary:需要创建临时表来存储查询的结果,常见于ORDER BYGROUP BY
    • Using filesort:表示MySQL使用了文件排序算法来排序结果集,而不是通过索引顺序快速获取结果。 这通常是因为没有合适的索引或者查询需要进行排序操作;
    • Using index condition:查询优化器选择使用了索引条件下推这个特性;

测试环境准备

创建测试表task_info

mysql> create table task_info (id bigint(20) unsigned not null auto_increment, task_id varchar(32) not null default "", usetime int(4), source char(8), model_num int(4), cube_num int(4), primary key(id)) engine=InnoDB CHARSET=utf8mb4;

其中列id是自增的primary key。对列task_idusetime添加单列索引:

mysql> alter table task_info add index index_task_id(task_id);

mysql> alter table task_info add index index_usetime(usetime);

对列model_numcube_num添加联合索引:

mysql> alter table task_info add index index_model_cube_num(model_num, cube_num);

测试表task_info的结构如下:

mysql> show columns from task_info;
+-----------+-----------------+------+-----+---------+----------------+
| Field     | Type            | Null | Key | Default | Extra          |
+-----------+-----------------+------+-----+---------+----------------+
| id        | bigint unsigned | NO   | PRI | NULL    | auto_increment |
| task_id   | varchar(32)     | NO   | MUL |         |                |
| usetime   | int             | YES  | MUL | NULL    |                |
| source    | char(8)         | YES  |     | NULL    |                |
| model_num | int             | YES  | MUL | NULL    |                |
| cube_num  | int             | YES  |     | NULL    |                |
+-----------+-----------------+------+-----+---------+----------------+

索引的信息如下:

mysql> show index from task_info;
+-----------+------------+----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Table     | Non_unique | Key_name             | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+-----------+------------+----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| task_info |          0 | PRIMARY              |            1 | id          | A         |           0 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
| task_info |          1 | index_task_id        |            1 | task_id     | A         |           0 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
| task_info |          1 | index_usetime        |            1 | usetime     | A         |           0 |     NULL |   NULL | YES  | BTREE      |         |               | YES     | NULL       |
| task_info |          1 | index_model_cube_num |            1 | model_num   | A         |           0 |     NULL |   NULL | YES  | BTREE      |         |               | YES     | NULL       |
| task_info |          1 | index_model_cube_num |            2 | cube_num    | A         |           0 |     NULL |   NULL | YES  | BTREE      |         |               | YES     | NULL       |
+-----------+------------+----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+

插入一千万数据到测试表,批量插入脚本如下:

import pymysql
import random
import uuid


sources = ["1", "2", "3", "4", "5" ,"6"]

def get_one_task_info():
    task_id = uuid.uuid1().hex
    usetime = random.randint(1, 3000)
    source  = sources[random.randint(0, 5)]
    model_num = random.randint(20, 400)
    cube_num = random.randint(1, 200)
    return task_id, usetime, source, model_num, cube_num

def main():
    db = pymysql.connect(host="localhost", port=3306, user="root", db="testdb", password="xxxx")
    cursor = db.cursor()
    sql = "INSERT INTO task_info(task_id, usetime, source, model_num, cube_num) VALUES (%s, %s, %s, %s, %s)"
    count = 8000000
    V = list()
    for _ in range(count):
        V.append(get_one_task_info())
    cursor.executemany(sql, tuple(V))
    db.commit()
    cursor.close()
    db.close()


if __name__ == "__main__":
    main()

执行完,查看表的总数目如下:

mysql> select count(1) from task_info;
+----------+
| count(1) |
+----------+
| 10000001 |
+----------+
1 row in set (1.54 sec)

最终表中id列是主键索引,task_idusetime两列是两个普通索引,source没有添加索引,model_numcube_num是一个联合索引。

测试结果

普通索引 vs 没有索引

使用有普通索引的列usetime查询,执行计划如下:

mysql> explain select SQL_NO_CACHE count(1) from task_info where usetime=100;
+----+-------------+-----------+------------+------+---------------+---------------+---------+-------+------+----------+-------------+
| id | select_type | table     | partitions | type | possible_keys | key           | key_len | ref   | rows | filtered | Extra       |
+----+-------------+-----------+------------+------+---------------+---------------+---------+-------+------+----------+-------------+
|  1 | SIMPLE      | task_info | NULL       | ref  | index_usetime | index_usetime | 5       | const | 3296 |   100.00 | Using index |
+----+-------------+-----------+------------+------+---------------+---------------+---------+-------+------+----------+-------------+
1 row in set, 2 warnings (0.00 sec)

查询使用索引index_usetime,查询类型为ref。查询结果及耗时如下:

mysql> select SQL_NO_CACHE count(1) from task_info where usetime=100;
+----------+
| count(1) |
+----------+
|     3296 |
+----------+
1 row in set, 1 warning (0.00 sec)

不使用索引的列source查询,执行计划如下:

mysql> explain select SQL_NO_CACHE count(1) from task_info where source=3;
+----+-------------+-----------+------------+------+---------------+------+---------+------+---------+----------+-------------+
| id | select_type | table     | partitions | type | possible_keys | key  | key_len | ref  | rows    | filtered | Extra       |
+----+-------------+-----------+------------+------+---------------+------+---------+------+---------+----------+-------------+
|  1 | SIMPLE      | task_info | NULL       | ALL  | NULL          | NULL | NULL    | NULL | 9435886 |    10.00 | Using where |
+----+-------------+-----------+------------+------+---------------+------+---------+------+---------+----------+-------------+
1 row in set, 2 warnings (0.00 sec)

查询不使用索引,查询类型为ALL,查询结果及耗时如下:

mysql> select SQL_NO_CACHE count(1) from task_info where source=3;
+----------+
| count(1) |
+----------+
|  1666975 |
+----------+
1 row in set, 1 warning (5.06 sec)

结论:使用索引可以有效提高查询效率。

联合索引

联合索引需要满足最左匹配原则,例如对于联合索引index_model_cube_num (model_num, cube_num),查询执行计划如下:

mysql> explain select SQL_NO_CACHE * from task_info where model_num=184 and cube_num=141;
+----+-------------+-----------+------------+------+----------------------+----------------------+---------+-------------+------+----------+-------+
| id | select_type | table     | partitions | type | possible_keys        | key                  | key_len | ref         | rows | filtered | Extra |
+----+-------------+-----------+------------+------+----------------------+----------------------+---------+-------------+------+----------+-------+
|  1 | SIMPLE      | task_info | NULL       | ref  | index_model_cube_num | index_model_cube_num | 10      | const,const |  148 |   100.00 | NULL  |
+----+-------------+-----------+------------+------+----------------------+----------------------+---------+-------------+------+----------+-------+
1 row in set, 2 warnings (0.00 sec)

查询条件为model_num and cube_num时,使用联合索引,查询类型为ref。查询条件为model_num的执行计划如下:

mysql> explain select SQL_NO_CACHE count(*) from task_info where model_num=184;
+----+-------------+-----------+------------+------+----------------------+----------------------+---------+-------+-------+----------+-------------+
| id | select_type | table     | partitions | type | possible_keys        | key                  | key_len | ref   | rows  | filtered | Extra       |
+----+-------------+-----------+------------+------+----------------------+----------------------+---------+-------+-------+----------+-------------+
|  1 | SIMPLE      | task_info | NULL       | ref  | index_model_cube_num | index_model_cube_num | 5       | const | 57264 |   100.00 | Using index |
+----+-------------+-----------+------------+------+----------------------+----------------------+---------+-------+-------+----------+-------------+
1 row in set, 2 warnings (0.01 sec)

查询条件model_num时,使用联合索引,查询类型为ref。当查询条件为cube_num,执行计划如下:

mysql> explain select SQL_NO_CACHE count(*) from task_info where cube_num=141;
+----+-------------+-----------+------------+-------+----------------------+----------------------+---------+------+---------+----------+----------------------------------------+
| id | select_type | table     | partitions | type  | possible_keys        | key                  | key_len | ref  | rows    | filtered | Extra                                  |
+----+-------------+-----------+------------+-------+----------------------+----------------------+---------+------+---------+----------+----------------------------------------+
|  1 | SIMPLE      | task_info | NULL       | range | index_model_cube_num | index_model_cube_num | 10      | NULL | 2358971 |   100.00 | Using where; Using index for skip scan |
+----+-------------+-----------+------------+-------+----------------------+----------------------+---------+------+---------+----------+----------------------------------------+
1 row in set, 2 warnings (0.00 sec)

查询条件为cube_num时,在新版的MySQL中使用联合索引,但查询类型为rangeExtra包含Using index for skip scan, 这是新版MySQL优化;对于老版本的MySQL,则不会使用联合索引,查询类型为ALL。查询效率对比如下:

# 使用 model_num 查询,满足最左匹配
mysql> select SQL_NO_CACHE count(*) from task_info where model_num=184;
+----------+
| count(*) |
+----------+
|    26432 |
+----------+
1 row in set, 1 warning (0.02 sec)

# 使用 cube_num 查询,不满足最左匹配
mysql> select SQL_NO_CACHE count(*) from task_info where cube_num=141;
+----------+
| count(*) |
+----------+
|    49772 |
+----------+
1 row in set, 1 warning (0.42 sec)

索引范围查找

对于索引范围查找,MySQL优化器会自动根据查询范围大小选择使用索引还是不使用索引。例如:

# 范围小,使用索引
mysql> explain select SQL_NO_CACHE * from task_info where usetime>4000;
+----+-------------+-----------+------------+-------+---------------+---------------+---------+------+------+----------+-----------------------+
| id | select_type | table     | partitions | type  | possible_keys | key           | key_len | ref  | rows | filtered | Extra                 |
+----+-------------+-----------+------------+-------+---------------+---------------+---------+------+------+----------+-----------------------+
|  1 | SIMPLE      | task_info | NULL       | range | index_usetime | index_usetime | 5       | NULL |    1 |   100.00 | Using index condition |
+----+-------------+-----------+------------+-------+---------------+---------------+---------+------+------+----------+-----------------------+
1 row in set, 2 warnings (0.00 sec)

# 范围大,不使用索引
mysql> explain select SQL_NO_CACHE * from task_info where usetime>1000;
+----+-------------+-----------+------------+------+---------------+------+---------+------+---------+----------+-------------+
| id | select_type | table     | partitions | type | possible_keys | key  | key_len | ref  | rows    | filtered | Extra       |
+----+-------------+-----------+------------+------+---------------+------+---------+------+---------+----------+-------------+
|  1 | SIMPLE      | task_info | NULL       | ALL  | index_usetime | NULL | NULL    | NULL | 9435886 |    50.00 | Using where |
+----+-------------+-----------+------------+------+---------------+------+---------+------+---------+----------+-------------+
1 row in set, 2 warnings (0.01 sec)