Monday, October 24, 2016

MySQL fulltext indexing

MySQL fulltext indexing

Introduction

disclaimer: The introduction part is not my writing, but got it from here . It is well explained that i thought i should use there explanation to share my study and test. So please respect/acknowledge copyright. 

When you have to search records on a MySQL table that contains a cartain word or phrase, you use the LIKE statement to filter your search. However, when the text column is large and the number of rows in a table is increased, using these methods has some limitations [1]:
  •  Performance: MySQL has to scan the whole table to find the exact text based on a pattern in the LIKE  statement or pattern in the regular expressions.
  • Flexible search: with the LIKE statement and regular expression search, it is difficult to have a flexible search query e.g., to find product whose description contains car  but not classic.
  • Relevance ranking: there is no way to specify which row in the result set is more relevant.
Because of these limitations, MySQL extended a very nice feature so-called full-text search. Technically, MySQL creates an index from the words of the enabled full-text search column and performs searches on this index. MySQL uses a sophisticated algorithm to determine the rows matched against the search query.
Listed below are some important features of MySQL full-text search [1]:

  • Native SQL-like interface: you use the SQL-like statement to use the full-text search.
  • Fully dynamic index: MySQL automatically updates the index of text column whenever the data of that column changes.
  • Moderate index size: it doesn’t take much memory to store the index.
  • Last but not least, it is fast to search based on complex search queries.
Notice that not all storage engines support the full-text search feature. In MySQL version 5.6 or later, only MyISAM and InnoDB storage engines support full-text search.

Diving into the action

Lets start by examing the table we are going to be using today for testing and for our fulltext index study. I will not explain everything below as some of the commands are self explanatory and hope you can follow through with ease on what i am trying to achieve:

First, lets check the file structure and size of files in the test database, focusing on the product table as the test in this blog will done on it. Later, we will compare this file structure with one after creating a flutist index on one of the fields on the product table.

george.chilumbu@nangang-vm-db-pxc-test-1:~$ sudo ls -lh /srv/mysql/test
total 557M
-rw-rw---- 1 mysql mysql   61 Oct 12 11:30 db.opt
-rw-rw---- 1 mysql mysql 8.5K Oct 24 16:13 product.frm
-rw-rw---- 1 mysql mysql  96K Oct 24 16:13 product.ibd
-rw-rw---- 1 mysql mysql 8.5K Oct 12 15:38 sbtest1.frm
-rw-rw---- 1 mysql mysql 556M Oct 12 15:41 sbtest1.ibd
-rw-rw---- 1 mysql mysql 8.5K Oct 12 14:18 testr.frm
-rw-rw---- 1 mysql mysql 112K Oct 12 14:18 testr.ibd
george.chilumbu@nangang-vm-db-pxc-test-1:~$


mysql> show create table product\G
*************************** 1. row ***************************
       Table: product
Create Table: CREATE TABLE `product` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `name` varchar(100) NOT NULL,
  `isbn` varchar(60) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `btree__name` (`name`)
) ENGINE=InnoDB AUTO_INCREMENT=45 DEFAULT CHARSET=utf8
1 row in set (0.00 sec)

mysql>

mysql> select name from product;
+--------------------------------------------------------------------+
| name                                                                                   |
+--------------------------------------------------------------------+
| Samsung smart TV HD with Wifi 55 inches screen curve |
| Sharp smart TV HD with Wifi 55 inches screen curve      |
| BenQ smart TV HD with Wifi 55 inches screen curve      |
| Sharp easy TV HD with Wifi 55 inches screen curve        |
| Sharp eden TV HD with Wifi 55 inches screen curve        |
| Sharp edge TV HD with Wifi 55 inches screen curve        |
| Sony edge TV HD with Wifi 55 inches screen curve         |
| Panasonic edge TV HD with Wifi 55 inches screen curve |
| Fiji edge TV HD with Wifi 55 inches screen curve            |
| LG edge TV HD with Wifi 55 inches screen curve            |
| Lenovo edge TV HD with Wifi 55 inches screen curve     |
| Toshiba edge TV HD with Wifi 55 inches screen curve     |
| iview edge TV HD with Wifi 55 inches screen curve        |
| Reca edge TV HD with Wifi 55 inches screen curve         |
| Insimia edge TV HD with Wifi 55 inches screen curve     |
+--------------------------------------------------------------------+
15 rows in set (0.00 sec)

mysql>

As seen below, the index data lengh is zero with B-Tree index.

mysql> show table status like 'product'\G
*************************** 1. row ***************************
           Name: product
         Engine: InnoDB
        Version: 10
     Row_format: Compact
           Rows: 15
 Avg_row_length: 1092
    Data_length: 16384
Max_data_length: 0
   Index_length: 0
      Data_free: 0
 Auto_increment: 45
    Create_time: 2016-10-24 16:13:09
    Update_time: NULL
     Check_time: NULL
      Collation: utf8_general_ci
       Checksum: NULL
 Create_options:
        Comment:
1 row in set (0.00 sec)

mysql>
mysql> alter table product drop index btree__name;
Query OK, 0 rows affected (0.05 sec)
Records: 0  Duplicates: 0  Warnings: 0
mysql>
mysql> alter table product add fulltext fulltext__name(name);
Query OK, 0 rows affected, 1 warning (0.23 sec)
Records: 0  Duplicates: 0  Warnings: 1

NOTE: In production, never use the alter table statement directly as it places a lock on the table. Large tables can take hours when you alter them, meaning no writes during the process. Instead, use the online-pt-schema-change from the percona toolkit.

mysql> show create table product\G
*************************** 1. row ***************************
       Table: product
Create Table: CREATE TABLE `product` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `name` varchar(100) NOT NULL,
  `isbn` varchar(60) DEFAULT NULL,
  PRIMARY KEY (`id`),
  FULLTEXT KEY `fulltext__name` (`name`)
) ENGINE=InnoDB AUTO_INCREMENT=45 DEFAULT CHARSET=utf8
1 row in set (0.00 sec)

mysql>
mysql> show table status like 'product'\G
*************************** 1. row ***************************
           Name: product
         Engine: InnoDB
        Version: 10
     Row_format: Compact
           Rows: 15
 Avg_row_length: 1092
    Data_length: 16384
Max_data_length: 0
   Index_length: 16384
      Data_free: 0
 Auto_increment: 45
    Create_time: 2016-10-24 16:15:45
    Update_time: NULL
     Check_time: NULL
      Collation: utf8_general_ci
       Checksum: NULL
 Create_options:
        Comment:
1 row in set (0.00 sec)

mysql>

As you can see above, the index length has now increased, utilizing the same amount of disk space as the original data. In other words, disk usage has doubled following creation of the fulltext index.

Now lets try to search for a record containing the word 'BenQ' using the where clause as we would with B-tree indexes.

mysql> select * from product where name like 'BenQ%'\G
*************************** 1. row ***************************
  id: 8
name: BenQ smart TV HD with Wifi 55 inches screen curve
isbn: george.valdpolo@nangang-vm-db-pxc-test-3:~$
1 row in set (0.00 sec)

mysql>
mysql> explain extended select * from product where name like 'BenQ%'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: product
         type: ALL
possible_keys: fulltext__name
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 15
     filtered: 100.00
        Extra: Using where
1 row in set, 1 warning (0.00 sec)

mysql>

As you can see above, the fulltext index does not use the where cluase to search a keyword like the Btree index does. Instead, fulltext uses the MATCH()  and AGAINST() functions to seach a keyword as demonstrated below. By default, fulltext uses natural language search where it looks for records that are relevant to the free-text natural human language query.

Using our product table, lets try to search for products whose names contains 'BenQ'

mysql> select * from product where match(name) against('BenQ')\G
*************************** 1. row ***************************
  id: 8
name: BenQ smart TV HD with Wifi 55 inches screen curve
isbn: george.valdpolo@nangang-vm-db-pxc-test-3:~$
1 row in set (0.00 sec)

mysql>

mysql> explain extended select * from product where match(name) against('BenQ');
mysql> explain extended select * from product where match(name) against('BenQ')\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: product
         type: fulltext
possible_keys: fulltext__name
          key: fulltext__name
      key_len: 0
          ref: NULL
         rows: 1
     filtered: 100.00
        Extra: Using where
1 row in set, 1 warning (0.00 sec)


Sort the result set by relevance


A very important feature of full-text search is how MySQL ranks the rows in the result set based on their relevance. When the MATCH() function is used in the WHERE clause, MySQL returns the rows that are more relevant first.

Lets try to search for products whose names contains 'Sharp' and 'edge'

mysql> select name from product where match(name) against('Sharp, edge');
+---------------------------------------------------------------------+
| name                                                                                    |
+---------------------------------------------------------------------+
| Sharp edge TV HD with Wifi 55 inches screen curve        |
| Sharp smart TV HD with Wifi 55 inches screen curve       |
| Sharp easy TV HD with Wifi 55 inches screen curve         |
| Sharp eden TV HD with Wifi 55 inches screen curve        |
| Sony edge TV HD with Wifi 55 inches screen curve         |
| Panasonic edge TV HD with Wifi 55 inches screen curve |
| Fiji edge TV HD with Wifi 55 inches screen curve            |
| LG edge TV HD with Wifi 55 inches screen curve            |
| Lenovo edge TV HD with Wifi 55 inches screen curve     |
| Toshiba edge TV HD with Wifi 55 inches screen curve     |
| iview edge TV HD with Wifi 55 inches screen curve         |
| Reca edge TV HD with Wifi 55 inches screen curve          |
| Insimia edge TV HD with Wifi 55 inches screen curve      |
+---------------------------------------------------------------------+
13 rows in set (0.00 sec)

mysql>

mysql>  explain extended select * from product where match(name) against('Sharp, edge')\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: product
         type: fulltext
possible_keys: fulltext__name
          key: fulltext__name
      key_len: 0
          ref: NULL
         rows: 1
     filtered: 100.00
        Extra: Using where
1 row in set, 1 warning (0.00 sec)

mysql>

As you can see from the result set, MySQl first searches for products which contains sharp and edge first, then finnaly all products containing edge. This shows how fulltext indexes sort by relevance.


Word length

The minimum length of the keyword to be searched using full-text is 4. Therefore, if your keyword contains characters less than 4, the query will not return any results as demonstrated below:

mysql> select name from product where match(name) against('Sharp, edg');
+----------------------------------------------------+
| name                                               |
+----------------------------------------------------+
| Sharp smart TV HD with Wifi 55 inches screen curve |
| Sharp easy TV HD with Wifi 55 inches screen curve  |
| Sharp eden TV HD with Wifi 55 inches screen curve  |
| Sharp edge TV HD with Wifi 55 inches screen curve  |
+----------------------------------------------------+
4 rows in set (0.00 sec)

mysql>

The above query only return results on 'Sharp' since it satisfied the requirement of characters more than 4. But the query below does not return anything.

mysql> select * from product where match(name) against('Sha, edg');
Empty set (0.00 sec)

mysql>

Comparing the file structure after creating a flutist index

Now, i would like to go back to talk about the file structure when it comes to fulltext. Lets compare the file stucture of the test database before (when it used B-tree index) and after the fulltext indexing:

Using B-Tree indexing

george.chilumbu@nangang-vm-db-pxc-test-1:~$ sudo ls -lh /srv/mysql/test
total 557M
-rw-rw---- 1 mysql mysql   61 Oct 12 11:30 db.opt
-rw-rw---- 1 mysql mysql 8.5K Oct 24 16:13 product.frm
-rw-rw---- 1 mysql mysql  96K Oct 24 16:13 product.ibd
-rw-rw---- 1 mysql mysql 8.5K Oct 12 15:38 sbtest1.frm
-rw-rw---- 1 mysql mysql 556M Oct 12 15:41 sbtest1.ibd
-rw-rw---- 1 mysql mysql 8.5K Oct 12 14:18 testr.frm
-rw-rw---- 1 mysql mysql 112K Oct 12 14:18 testr.ibd
george.chilumbu@nangang-vm-db-pxc-test-1:~$

Using fulltext indexing

george.chilumbu@nangang-vm-db-pxc-test-1:~$ sudo ls -lh /srv/mysql/test/
total 558M
-rw-rw---- 1 mysql mysql   61 Oct 12 11:30 db.opt
-rw-rw---- 1 mysql mysql  96K Oct 24 16:15 FTS_0000000000000099_0000000000000122_INDEX_1.ibd
-rw-rw---- 1 mysql mysql  96K Oct 24 16:15 FTS_0000000000000099_0000000000000122_INDEX_2.ibd
-rw-rw---- 1 mysql mysql  96K Oct 24 16:15 FTS_0000000000000099_0000000000000122_INDEX_3.ibd
-rw-rw---- 1 mysql mysql  96K Oct 24 16:15 FTS_0000000000000099_0000000000000122_INDEX_4.ibd
-rw-rw---- 1 mysql mysql  96K Oct 24 16:15 FTS_0000000000000099_0000000000000122_INDEX_5.ibd
-rw-rw---- 1 mysql mysql  96K Oct 24 16:15 FTS_0000000000000099_0000000000000122_INDEX_6.ibd
-rw-rw---- 1 mysql mysql  96K Oct 24 16:15 FTS_0000000000000099_BEING_DELETED_CACHE.ibd
-rw-rw---- 1 mysql mysql  96K Oct 24 16:15 FTS_0000000000000099_BEING_DELETED.ibd
-rw-rw---- 1 mysql mysql  96K Oct 24 16:15 FTS_0000000000000099_CONFIG.ibd
-rw-rw---- 1 mysql mysql  96K Oct 24 16:15 FTS_0000000000000099_DELETED_CACHE.ibd
-rw-rw---- 1 mysql mysql  96K Oct 24 16:15 FTS_0000000000000099_DELETED.ibd
-rw-rw---- 1 mysql mysql 8.5K Oct 24 16:15 product.frm
-rw-rw---- 1 mysql mysql 112K Oct 24 16:15 product.ibd
-rw-rw---- 1 mysql mysql 8.5K Oct 12 15:38 sbtest1.frm
-rw-rw---- 1 mysql mysql 556M Oct 12 15:41 sbtest1.ibd
-rw-rw---- 1 mysql mysql 8.5K Oct 12 14:18 testr.frm
-rw-rw---- 1 mysql mysql 112K Oct 12 14:18 testr.ibd
george.chilumbu@nangang-vm-db-pxc-test-1:~$

As we can see, fulltext add extra files and uses more disk space than Btree. This is something to be aware of especially when the table is super large.

Besides the natural language search, fulltext also supports an additional form of indexing search that is called Boolean full-text search. But i will not discuss that here.

REFERENCES
[1] http://www.mysqltutorial.org/introduction-to-mysql-full-text-search.aspx

No comments:

Post a Comment