A short focus on text search

Posted 6 years ago.

brute force approach

Postgresql proposes several ways for searching text. Of course there is the good ol’ LIKE operator (note that Postgres also offers a ILIKE operator for case insensitive search)

Postgresql newcomers may not know it is possible to use regular expressions to search for text patterns using the ~ and ~* (case insensitive) operators:

  blog.post p
  p.content ~* '[[:<:]]postgres(ql)?[[:>:]]'

The example above looks for blog posts containing the words postgres or postgresql case insensitive.

Both LIKE and regular expression operators can cause performance problems as these kinds of queries may not be able use indexes so it means a full table scan every time. This is possible for small fields on a limited number of records but it is unacceptable if the table is huge and the text field is large. Furthermore, it is hard to match approaching words which is something often desired when looking for wide text fields.

To circumvent these problems, Postgres comes with a full text search engine. By default it supports 15 languages:

  • Danish
  • Dutch
  • English
  • Finnish
  • French
  • German
  • Hungarian
  • Italian
  • Norwegian
  • Portuguese
  • Romanian
  • Russian
  • Spanish
  • Swedish
  • Turkish

Of course, it is possible to install more dictionaries if needed. Let’s see how it works. When the text is cast as text search vecrtors (tsvector type), it is stripped from stop words and turned into searchable lexemes:

select to_tsvector(
    'This software does automatically enhance the colors, especially RGB components.'
'automat':4 'color':7 'compon':10 'enhanc':5 'especi':8 'rgb':9 'softwar':2
(1 row)

Searches in tsvectors can be made using the @@ operator and a ts_query type. If the above text is searched for the words enhanced colors, even though it does not contain the exact words, it still matches the criteria:

    'This software does automatically enhance the colors, especially RGB components.'
  ) @@ to_tsquery('english', 'enhanced & color') as "does it match ?"
does it match ?
(1 row)

So a query against the blog.post table would be

  to_tsvector(p.title || ' ' || p.content) @@ to_tsquery($*)

It is possible to use indices on that kind of searches by using a GIN index:

create index post_content_tsvector_idx
on blog.post
using gin(to_tsvector('english', title || ' ' || content))

This works because the query planner sees the query and the index both use the same immutable function with the same arguments and it brings back full text search queries in the realm of the milliseconds. But it has a major flaw: if a new column has to be used in the query, all SQL queries have to be rewritten to use the new index. This is acceptable for simple applications with limited ways of searching content but it can be a real problem with more complex search systems where content must be weighted by example. In this case, a column can be added to the table containing the pre-calculated tsvectors so queries will always be column @@ to_tsquery($*). This requires a trigger to catch write operations on the table to calculate the tsvector column every time. The good side of this method is to have the to_tsvector function call only in one place. In Postgres 9.4, it is possible to use the functions tsvector_update_trigger and tsvector_update_trigger_column as triggers (see the documentation).

text search vectors and Pomm model manager

Adding an extra column containing the vectors is not a big deal since it is easy not to hydrate entities with it:

    // PostModel class
    public function createProjection()
        return parent::createProjection()

Query the database as normal:

$posts = $session
        "precalculated_ts_vector @@ plainto_tsquery('english', $*)",
        ['my list of searched words']

The plainto_tsquery function does the simple job of stemming the input and adding a & between the remaining words. The function to_tsquery can be used to issue more complex queries. (see the documentation about it).

Of course, it is possible to tune the search, to weight some of the columns or modify results ranking to cross full text search with other criteria. There will be another article to explain how to index LIKE statements using the pg_trgm extension and make reactive autocomplete input fields.