Advanced text searching using full-text indexes

Creating an index of large text fields is not practical in its default state - text fields have certain characteristics that warrant their own form of indexing. For example, many simple words ("at", "the", etc) are unimportant, and therefore aren't stored in FULLTEXT indexes. Furthermore, MySQL has an infamous "50% rule", whereby it will not bother indexing words that appear in the more than 50% of records simply because it will not be of use.

Creating a full text index is done after creating your table, using the following command: ALTER TABLE some_table ADD FULLTEXT (some_field). Here is an example:

ALTER TABLE usertable ADD FULLTEXT(FirstName);

Once we have a FULLTEXT index on our table, it opens up a whole new world of pattern matching, which, as of MySQL 4, allows us to do boolean matching with little fuss. FULLTEXT queries look a little more complicated that normal SELECT queries, but are much more powerful - they can even be used to return "match quality", as you will soon see. To properly demonstrate FULLTEXT indexes, we need a table with lots of text in:

CREATE TABLE opinions (Opinion CHAR(100));
INSERT INTO opinions VALUES ('MySQL is a very fast database');
INSERT INTO opinions VALUES ('Green is everyone\'s favourite colour');
INSERT INTO opinions VALUES ('Databases are helpful for storing data');
INSERT INTO opinions VALUES ('PHP is a very nice language');
INSERT INTO opinions VALUES ('Spain is a nice country to visit');
INSERT INTO opinions VALUES ('Perl isn\'t as nice a language as PHP');
INSERT INTO opinions VALUES ('This is a blank row to avoid the 50% rule');
ALTER TABLE opinions ADD FULLTEXT (Opinion);

That gives us a very basic table with a FULLTEXT index on the opinion text. From there, we can jump in with a very basic SELECT query to match all rows with "nice" in:

SELECT * FROM opinions WHERE MATCH(Opinion) AGAINST ('nice');

That query generates the following output:

+---------------------------------------+
| Opinion                               |
+---------------------------------------+
| PHP is a very nice language           |
| Spain is a nice country to visit      |
| Perl is not as nice a language as PHP |
+---------------------------------------+

As you can see, the query has indeed matched the three rows that contain the string "nice". One important thing to note is that the minimum size of word that MySQL will index is, by default, four characters - anything smaller than that is rarely worthwhile.

Now consider this second query:

SELECT * FROM opinions WHERE MATCH(Opinion) AGAINST ('nice language');

What do you think it will output? If you thought it would output just one row, you would be wrong - MySQL uses OR by default when matching words, meaning it will return any row that matches "nice" or matches "language" - this is where boolean mode searches come in.

Boolean mode searching has long been popular with internet search engines - they allow you to proceed words with a + or a - to force it to either be present (+) or not present (-). You can switch to boolean mode searching using FULLTEXT indexes by adding "IN BOOLEAN MODE" to the query, like this:

SELECT * FROM opinions WHERE MATCH(Opinion) AGAINST ('nice -language' IN BOOLEAN MODE);

Note that the "IN BOOLEAN MODE" modifier comes inside the AGAINST brackets. This time the query returns just one result - "Spain is a nice country to visit". This is because "-language" means that MySQL will not return any rows that match "language", even if they match "nice". Searching in MySQL has a number of other ways it can be used - for example, putting double quotes around groups of words allow phrase searching. That is, to match precisely "nice language", we would use the following query:

SELECT * FROM opinions WHERE Match(Opinion) AGAINST ('"nice language"' IN BOOLEAN MODE);

That matches rows that have "nice language" just like that - no words in between, not one or the other. As such, only "PHP is a very nice language" is matched.

There are other boolean operators, although they get more complicated. For example, > and < mark that the following word has a higher or lower relevance respectively than other words. The tilde symbol, ~, means that the following word should contribute negatively to the relevance of the row - this is not the same as <, which marks a word as being less relevant but still relevant nonetheless, and -, which excludes a word altogether. The asterisk symbol, *, allows wildcard matching. Finally, you can use parentheses, ( and ), to group words into subexpressions.

So, all of these examples are possible boolean queries:

nice language

Match either nice, language, or both

+nice +language

Match both nice and language

+nice -language

Match nice but not language

+nice ~language

Match nice, but mark down as less relevant rows that contain language

+nice*

Match nice, nicely, nicety, nice language, etc

"nice language"

Match the exact term "nice language"

+nice +(language country)

Match either "nice language" or "nice country"

+nice +(>language <country)

Match either "nice language" or "nice country", with rows matching "nice language" being considered more relevant

Before we take a look at how these effect queries in practical use, I'd first like to introduce the topic of relevance. Result relevant, often known as match quality or score, is a number returned by MySQL that allows you to rank how much of a query a row matched. To get relevance from a query, simply copy the full-text WHERE clause into the fields you want to select, like this:

SELECT Opinion, MATCH(Opinion) AGAINST('nice language' IN BOOLEAN MODE) AS Score FROM opinions WHERE MATCH(Opinion) AGAINST ('nice language' IN BOOLEAN MODE);

Do not worry about using MATCH() twice - the two matches are identical, and the MySQL optimizer will pick up on this and only execute it once. However, it does give us a new field, Score, to play with. So, the query above says "Select all opinions and their relevance where the row has either "nice" or "language". Here is what the above query outputs:

+---------------------------------------+-------+
| Opinion                               | Score |
+---------------------------------------+-------+
| PHP is a very nice language           |     2 |
| Spain is a nice country to visit      |     1 |
| Perl is not as nice a language as PHP |     2 |
+---------------------------------------+-------+

Naturally it would be best to sort by Score descending to get the matches in order of most relevant first, however note that the Score of the two fields with "nice" and language in is 2, whereas the field with just "nice" in is 1. Now we get to play with the extended boolean mode operators - for example, if we wanted to match "nice language" or "nice country", we could use this query:

SELECT Opinion, MATCH(Opinion) AGAINST('+nice +(language country)' IN BOOLEAN MODE) AS Score FROM opinions WHERE MATCH(Opinion) AGAINST ('+nice +(language country)' IN BOOLEAN MODE);

The pluses are in there to make sure that both nice and either language or country are included in the result. This time Score is 2 for all three fields, as all three match nice and all three have either language or country. We can give more relevancy information to MySQL by saying that country is more important to match than language by using >, like this:

SELECT Opinion, MATCH(Opinion) AGAINST('+nice +(language >country)' IN BOOLEAN MODE) AS Score FROM opinions WHERE MATCH(Opinion) AGAINST ('+nice +(language >country)' IN BOOLEAN MODE);

This time we get 1 for the two "nice language" matches and 1.25 for "nice country" - MySQL has increased the score for "country" by 0.25. We can push this further by decreasing the score for "language" at the same time, like this:

SELECT Opinion, MATCH(Opinion) AGAINST('+nice +(<language >country)' IN BOOLEAN MODE) AS Score FROM opinions WHERE MATCH(Opinion) AGAINST ('+nice +(<language >country)' IN BOOLEAN MODE);

The output for that query is shown below - as you can see, the nice language matches have moved down another notch.

+---------------------------------------+------------------+
| Opinion                               |            Score |
+---------------------------------------+------------------+
| PHP is a very nice language           | 0.83333337306976 |
| Spain is a nice country to visit      |             1.25 |
| Perl is not as nice a language as PHP | 0.83333337306976 |
+---------------------------------------+------------------+

The exact numbers in the score are usually irrelevant - it is the ranking that counts. Therefore, we could have achieved the same effect by using the negation operator, ~, which would subtract points from the word language if we used this query:

SELECT Opinion, MATCH(Opinion) AGAINST('+nice ~language)' IN BOOLEAN MODE) AS Score FROM opinions WHERE MATCH(Opinion) AGAINST ('+nice ~language)' IN BOOLEAN MODE);

This time MySQL will match all rows that contain nice or language, but will "penalise" rows that contain language. This is helpful for marking known irrelevant words without specifically removing rows that contain them.

 

Want to learn PHP 7?

Hacking with PHP has been fully updated for PHP 7, and is now available as a downloadable PDF. Get over 1200 pages of hands-on PHP learning today!

If this was helpful, please take a moment to tell others about Hacking with PHP by tweeting about it!

Next chapter: Range matching >>

Previous chapter: Simple text searching using LIKE

Jump to:

 

Home: Table of Contents

Copyright ©2015 Paul Hudson. Follow me: @twostraws.