Hudzilla.org - the homepage of Paul Hudson
Contents > Databases > SQL Wish List | Report Bug | About Me ]

9.3.18     Advanced text searching using full-text indexes

This is NOT the latest copy of this book; click here for the latest version.

Creating an index of large text fields is not practical in its default state - text fields have certain characteristics that warrant their own form of indexing. For example, many simple words ("at", "the", etc) are unimportant, and therefore aren't stored in FULLTEXT indexes. Furthermore, MySQL has an infamous "50% rule", whereby it will not bother indexing words that appear in the more than 50% of records simply because it will not be of use.

Creating a full text index is done after creating your table, using the following command: ALTER TABLE some_table ADD FULLTEXT (some_field). Here is an example:

ALTER TABLE usertable ADD FULLTEXT(FirstName);

Once we have a FULLTEXT index on our table, it opens up a whole new world of pattern matching, which, as of MySQL 4, allows us to do boolean matching with little fuss. FULLTEXT queries look a little more complicated that normal SELECT queries, but are much more powerful - they can even be used to return "match quality", as you will soon see. To properly demonstrate FULLTEXT indexes, we need a table with lots of text in:

CREATE TABLE opinions (Opinion CHAR(100));
INSERT INTO opinions VALUES ('MySQL is a very fast database');
INSERT INTO opinions VALUES ('Green is everyone\'s favourite colour');
INSERT INTO opinions VALUES ('Databases are helpful for storing data');
INSERT INTO opinions VALUES ('PHP is a very nice language');
INSERT INTO opinions VALUES ('Spain is a nice country to visit');
INSERT INTO opinions VALUES ('Perl isn\'t as nice a language as PHP');
INSERT INTO opinions VALUES ('This is a blank row to avoid the 50% rule');
ALTER TABLE opinions ADD FULLTEXT (Opinion);

That gives us a very basic table with a FULLTEXT index on the opinion text. From there, we can jump in with a very basic SELECT query to match all rows with "nice" in:

SELECT * FROM opinions WHERE MATCH(Opinion) AGAINST ('nice');

That query generates the following output:

+---------------------------------------+
| Opinion                               |
+---------------------------------------+
| PHP is a very nice language           |
| Spain is a nice country to visit      |
| Perl is not as nice a language as PHP |
+---------------------------------------+

As you can see, the query has indeed matched the three rows that contain the string "nice". One important thing to note is that the minimum size of word that MySQL will index is, by default, four characters - anything smaller than that is rarely worthwhile.

Now consider this second query:

SELECT * FROM opinions WHERE MATCH(Opinion) AGAINST ('nice language');

What do you think it will output? If you thought it would output just one row, you would be wrong - MySQL uses OR by default when matching words, meaning it will return any row that matches "nice" or matches "language" - this is where boolean mode searches come in.

Boolean mode searching has long been popular with Internet search engines - they allow you to proceed words with a + or a - to force it to either be present (+) or not present (-). You can switch to boolean mode searching using FULLTEXT indexes by adding "IN BOOLEAN MODE" to the query, like this:

SELECT * FROM opinions WHERE MATCH(Opinion) AGAINST ('nice -language' IN BOOLEAN MODE);

Note that the "IN BOOLEAN MODE" modifier comes inside the AGAINST brackets. This time the query returns just one result - "Spain is a nice country to visit". This is because "-language" means that MySQL will not return any rows that match "language", even if they match "nice". Searching in MySQL has a number of other ways it can be used - for example, putting double quotes around groups of words allow phrase searching. That is, to match precisely "nice language", we would use the following query:

SELECT * FROM opinions WHERE Match(Opinion) AGAINST ('"nice language"' IN BOOLEAN MODE);

That matches rows that have "nice language" just like that - no words in between, not one or the other. As such, only "PHP is a very nice language" is matched.

There are other boolean operators, although they get more complicated. For example, > and < mark that the following word has a higher or lower relevance respectively than other words. The tilde symbol, ~, means that the following word should contribute negatively to the relevance of the row - this is not the same as <, which marks a word as being less relevant but still relevant nonetheless, and -, which excludes a word altogether. The asterisk symbol, *, allows wildcard matching. Finally, you can use brackets, ( and ), to group words into subexpressions.

So, all of these examples are possible boolean queries:

nice language

Match either nice, language, or both

+nice +language

Match both nice and language

+nice -language

Match nice but not language

+nice ~language

Match nice, but mark down as less relevant rows that contain language

+nice*

Match nice, nicely, nicety, nice language, etc

"nice language"

Match the exact term "nice language"

+nice +(language country)

Match either "nice language" or "nice country"

+nice +(>language <country)

Match either "nice language" or "nice country", with rows matching "nice language" being considered more relevant

Before we take a look at how these effect queries in practical use, I'd first like to introduce the topic of relevance. Result relevant, often known as match quality or score, is a number returned by MySQL that allows you to rank how much of a query a row matched. To get relevance from a query, simply copy the full-text WHERE clause into the fields you want to select, like this:

SELECT Opinion, MATCH(Opinion) AGAINST('nice language' IN BOOLEAN MODE) AS Score FROM opinions WHERE MATCH(Opinion) AGAINST ('nice language' IN BOOLEAN MODE);

Do not worry about using MATCH() twice - the two matches are identical, and the MySQL optimizer will pick up on this and only execute it once. However, it does give us a new field, Score, to play with. So, the query above says "Select all opinions and their relevance where the row has either "nice" or "language". Here is what the above query outputs:

+---------------------------------------+-------+
| Opinion                               | Score |
+---------------------------------------+-------+
| PHP is a very nice language           |     2 |
| Spain is a nice country to visit      |     1 |
| Perl is not as nice a language as PHP |     2 |
+---------------------------------------+-------+

Naturally it would be best to sort by Score descending to get the matches in order of most relevant first, however note that the Score of the two fields with "nice" and language in is 2, whereas the field with just "nice" in is 1. Now we get to play with the extended boolean mode operators - for example, if we wanted to match "nice language" or "nice country", we could use this query:

SELECT Opinion, MATCH(Opinion) AGAINST('+nice +(language country)' IN BOOLEAN MODE) AS Score FROM opinions WHERE MATCH(Opinion) AGAINST ('+nice +(language country)' IN BOOLEAN MODE);

The pluses are in there to make sure that both nice and either language or country are included in the result. This time Score is 2 for all three fields, as all three match nice and all three have either language or country. We can give more relevancy information to MySQL by saying that country is more important to match than language by using >, like this:

SELECT Opinion, MATCH(Opinion) AGAINST('+nice +(language >country)' IN BOOLEAN MODE) AS Score FROM opinions WHERE MATCH(Opinion) AGAINST ('+nice +(language >country)' IN BOOLEAN MODE);

This time we get 1 for the two "nice language" matches and 1.25 for "nice country" - MySQL has increased the score for "country" by 0.25. We can push this further by decreasing the score for "language" at the same time, like this:

SELECT Opinion, MATCH(Opinion) AGAINST('+nice +(<language >country)' IN BOOLEAN MODE) AS Score FROM opinions WHERE MATCH(Opinion) AGAINST ('+nice +(<language >country)' IN BOOLEAN MODE);

The output for that query is shown below - as you can see, the nice language matches have moved down another notch.

+---------------------------------------+------------------+
| Opinion                               |            Score |
+---------------------------------------+------------------+
| PHP is a very nice language           | 0.83333337306976 |
| Spain is a nice country to visit      |             1.25 |
| Perl is not as nice a language as PHP | 0.83333337306976 |
+---------------------------------------+------------------+

The exact numbers in the score are usually irrelevant - it is the ranking that counts. Therefore, we could have achieved the same effect by using the negation operator, ~, which would subtract points from the word language if we used this query:

SELECT Opinion, MATCH(Opinion) AGAINST('+nice ~language)' IN BOOLEAN MODE) AS Score FROM opinions WHERE MATCH(Opinion) AGAINST ('+nice ~language)' IN BOOLEAN MODE);

This time MySQL will match all rows that contain nice or language, but will "penalise" rows that contain language. This is helpful for marking known irrelevant words without specifically removing rows that contain them.





<< 9.3.17 Simple text searching using LIKE   9.3.19 Range matching: between() and in() >>
Table of Contents
Want to see this stuff in print? PHP in a Nutshell takes the core topics covered here, adds in thousands of edits from the editorial team and myself, and combines them to make an unbeatable reference for PHP programmers at all levels.



My latest book has hundreds more tips on how to use PHP, Apache, and MySQL, plus Perl, Python, shell scripts, performance tuning, and more!



Top-right shadow
 
Bottom-left shadow Bottom shadow

Comments from other readers
A PHP User - 06 Sep 2008

Piotr:

Looks like the serial number repeats itself every four records - I am assuming the serial number is tied to the sensor in question. So, try something like this:

select * from digitemp where SerialNumber='desiredSensorSerialHere' order by time asc limit 288

Piotr - 06 Sep 2008

How to extract only the relevant rows from the database and use them with WHERE statement? My database looks like this:

| dtkey | time | SerialNumber | Celsius |
+-------+---------------------+------------------+---------+
| 1 | 2007-09-24 16:18:33 | 10781D4C0108007D | 20.62 |
| 2 | 2007-09-24 16:18:33 | 100470EE00080060 | 21.06 |
| 3 | 2007-09-24 16:18:33 | 1043AE4B010800EF | 21.06 |
| 4 | 2007-09-24 16:18:33 | 10A7EAED000800E7 | 20.88 |
| 5 | 2007-09-24 16:25:06 | 10781D4C0108007D | 20.56 |
| 6 | 2007-09-24 16:25:06 | 100470EE00080060 | 21.12 |
| 7 | 2007-09-24 16:25:06 | 1043AE4B010800EF | 21.06 |
| 8 | 2007-09-24 16:25:06 | 10A7EAED000800E7 | 20.88 |

Every 5 min the temperature of 4 sensors is writen... I need to show the data of the only one sensor... over the last 24 hours... and do so in ascending order...

If there would be only one sensor, this would work:

select * from (select * from digitemp order by time DESC limit 288) tmp WHERE dtkey = '1' or dtkey % 12 = 0 order by time asc;

Please, help

Piotr - 06 Sep 2008

How to extract only the relevant rows from the database and use them with WHERE statement? My database looks like this:

| dtkey | time | SerialNumber | Celsius |
+-------+---------------------+------------------+---------+
| 1 | 2007-09-24 16:18:33 | 10781D4C0108007D | 20.62 |
| 2 | 2007-09-24 16:18:33 | 100470EE00080060 | 21.06 |
| 3 | 2007-09-24 16:18:33 | 1043AE4B010800EF | 21.06 |
| 4 | 2007-09-24 16:18:33 | 10A7EAED000800E7 | 20.88 |
| 5 | 2007-09-24 16:25:06 | 10781D4C0108007D | 20.56 |
| 6 | 2007-09-24 16:25:06 | 100470EE00080060 | 21.12 |
| 7 | 2007-09-24 16:25:06 | 1043AE4B010800EF | 21.06 |
| 8 | 2007-09-24 16:25:06 | 10A7EAED000800E7 | 20.88 |

Every 5 min the temperature of 4 sensors is writen... I need to show the data of the only one sensor... over the last 24 hours... and do so in ascending order...

If there would be only one sensor, this would work:

select * from (select * from digitemp order by time DESC limit 288) tmp WHERE dtkey = '1' or dtkey % 12 = 0 order by time asc;

Please, help

liez - 06 Sep 2008

how to use the boolean search in multiple tables? is it possible to join those tables? or any other possible options?

vivek_guptaglaitm81@rediffmail.com - 06 Sep 2008

hi.
i am working in php. my problem is that i want t insert long data through insert query. but only one page size of data are submitted to the data base. plz tell me how to insert long data in to data base.

Gus Jones - 06 Sep 2008

Thomas,

You are correct - less than 4 char string it will ignore.

But you can tweak that value!

See:

"fulltext-fine-tuning" in the MYSQL manual

It will also ignore common words see:

“Full-Text Stopwords” in the manual

Thomas - 06 Sep 2008

Why does this not search all words??

It seems that words with 3 or less letters are ignored?

rajni/rajinikanth_4u@yahoo.com - 06 Sep 2008

i want to use the search option in developing the site for
30 pages .so, the database mysql is enough and list some of the functions and API's for searching the indexes and the contents for the result pages
thanks in advance..

hussain/hussainak17(at)hotmail(dot)com - 06 Sep 2008

Yup change the table type to MyISAM!!!!

A PHP User - 06 Sep 2008

HELP! I get a message indicating that the table type does not support FULLTEXT indexing?!?!



Add comment
Please note that by posting a comment here you are committing it to the public domain. This is important so that others can make use of your code themselves, and also so that I can incorporate helpful notes directly into the main text. Comments are limited to 2000 characters in length.

If you are reporting an error in the content, please tell me directly.

Your name/email address:
Your comment:
 
Now, in order to verify that you're a real person, please answer this simple question: what is three plus three?
The answer is:
(please write in
numbers, eg 19)


Top-right shadow
 
Bottom-left shadow Bottom shadow