Posted 6 years ago.
Postgresql proposes several ways for searching text. Of course there is the good ol’ LIKE
operator (note that Postgres also offers a ILIKE
operator for case insensitive search)
Postgresql newcomers may not know it is possible to use regular expressions to search for text patterns using the ~
and ~*
(case insensitive) operators:
select
p.slug,
p.title,
p.author_id,
p.content
from
blog.post p
where
p.content ~* '[[:<:]]postgres(ql)?[[:>:]]'
The example above looks for blog posts containing the words postgres
or postgresql
case insensitive.
Both LIKE and regular expression operators can cause performance problems as these kinds of queries may not be able use indexes so it means a full table scan every time. This is possible for small fields on a limited number of records but it is unacceptable if the table is huge and the text field is large. Furthermore, it is hard to match approaching words which is something often desired when looking for wide text fields.
To circumvent these problems, Postgres comes with a full text search engine. By default it supports 15 languages:
Of course, it is possible to install more dictionaries if needed. Let’s see how it works. When the text is cast as text search vecrtors (tsvector
type), it is stripped from stop words and turned into searchable lexemes:
select to_tsvector(
'english',
'This software does automatically enhance the colors, especially RGB components.'
);
to_tsvector
───────────────────────────────────────────────────────────────────────────
'automat':4 'color':7 'compon':10 'enhanc':5 'especi':8 'rgb':9 'softwar':2
(1 row)
Searches in tsvectors can be made using the @@
operator and a ts_query
type. If the above text is searched for the words enhanced colors
, even though it does not contain the exact words, it still matches the criteria:
select
to_tsvector(
'english',
'This software does automatically enhance the colors, especially RGB components.'
) @@ to_tsquery('english', 'enhanced & color') as "does it match ?"
;
does it match ?
───────────────
t
(1 row)
So a query against the blog.post
table would be
select
p.slug,
p.title,
p.author_id,
p.content
where
to_tsvector(p.title || ' ' || p.content) @@ to_tsquery($*)
;
It is possible to use indices on that kind of searches by using a GIN
index:
create index post_content_tsvector_idx
on blog.post
using gin(to_tsvector('english', title || ' ' || content))
;
This works because the query planner sees the query and the index both use the same immutable function with the same arguments and it brings back full text search queries in the realm of the milliseconds. But it has a major flaw: if a new column has to be used in the query, all SQL queries have to be rewritten to use the new index. This is acceptable for simple applications with limited ways of searching content but it can be a real problem with more complex search systems where content must be weighted by example. In this case, a column can be added to the table containing the pre-calculated tsvectors so queries will always be column @@ to_tsquery($*)
. This requires a trigger to catch write operations on the table to calculate the tsvector column every time. The good side of this method is to have the to_tsvector
function call only in one place. In Postgres 9.4, it is possible to use the functions tsvector_update_trigger
and tsvector_update_trigger_column
as triggers (see the documentation).
Adding an extra column containing the vectors is not a big deal since it is easy not to hydrate entities with it:
// PostModel class
public function createProjection()
{
return parent::createProjection()
->unsetField('precalculated_ts_vector')
;
}
Query the database as normal:
$posts = $session
->getModel(PostModel::class)
->findWhere(
"precalculated_ts_vector @@ plainto_tsquery('english', $*)",
['my list of searched words']
);
The plainto_tsquery
function does the simple job of stemming the input and adding a &
between the remaining words. The function to_tsquery
can be used to issue more complex queries. (see the documentation about it).
Of course, it is possible to tune the search, to weight some of the columns or modify results ranking to cross full text search with other criteria. There will be another article to explain how to index LIKE
statements using the pg_trgm
extension and make reactive autocomplete input fields.