Learn·Search·Postgres
Full Text Search + Vector Search with Postgres
December 16, 2023 · 11 min read
Search is a fundamental need for many applications. Consider a jobs platform like Y Combinator's Work at a Startup. Candidates use Work at a Startup to find for their next job. With thousands of jobs on the platform, candidates need an efficient way to sift through all of them to find jobs relevant to them. They may want to filter jobs based on location and salary, search for specific keywords, see most recent jobs first, or even receive personalized recommendations.
If you're using a database like Postgres, all of these search features are fully supported with Postgres and Postgres extensions. In this article, we'll explain how to leverage Postgres and the Lantern Postgres extension to support search, using job search as an example.
Setting up the database
In this article, we will assume the following simple setup. First, we create a table named jobs
to store the job postings, with each job having a unique id
, a title
, a description
, the country
where the job is located, and the time the job was created created_at
.
sql
CREATE TABLE jobs (
id SERIAL PRIMARY KEY,
title TEXT,
description TEXT,
country TEXT,
created_at TIMESTAMPTZ DEFAULT NOW()
);
Next, we populate the jobs
table with some data:
sql
INSERT INTO jobs (title, description, country)
VALUES ('Front End Developer', 'We are looking for a React Developer to join our team. They should have experience with React or other frontend frameworks.', 'US'),
('Software Engineer', 'We are looking for a generalist Software Engineer to join our team. Must have minimum three years of experience.', 'CA'),
('Software Engineer', 'Our company builds mobile apps using Expo. The ideal candidate has experience with Expo or similar frameworks and is proficient in JavaScript or TypeScript.', 'UK'),
('Data Scientist', 'Seeking a Data Scientist with expertise in machine learning and big data technologies. Familiarity with Python and R is required.', 'AU'),
('Project Manager', 'We require a Project Manager with experience in software development projects. Excellent communication and leadership skills are essential.', 'US');
Using columns for search
You can use the inherent structure of your tables for search. On our case, we have a jobs
table with title
, description
, country
, and created_at
columns. These columns can be used to filter results based on specific criteria. For example, if a candidate wants to find jobs located in the US posted in the last month, we can use the country
and created_at
columns to filter the results:
sql
SELECT * FROM jobs WHERE country = 'US' AND created_at > NOW() - INTERVAL '1 month';
To improve the performance of such queries, especially when dealing with large datasets, we can create an index. Indexing can significantly speed up data retrieval as it allows the database to quickly locate the data without having to scan every row in the table. For example, below we create an index on the country
column to speed up queries like SELECT * FROM jobs WHERE country = 'US'
.
sql
CREATE INDEX idx_jobs_country ON jobs (country);
Regex search for simple word queries
Postgres supports word search using regex matching, which can be useful for simple keyword queries. For instance, if a user wants to find jobs where the description contains the word "React", we can use the following query:
sql
SELECT * FROM jobs WHERE description ~* 'react';
However, regex matching involves sequential search and can be slow, especially for large datasets. It's also case-sensitive and doesn't support linguistic features such as stemming (e.g., including "program" results when the query is "programming") or stop word removal (e.g., removing "the" and "a" from the query). For more complex queries, full text search is a better option as it supports these features and is generally faster and more efficient.
Overview of Full Text Search API
Postgres provides a comprehensive set of functions for full text search. Detailed documentation can be found in the official Postgres docuemntation. Here, we provide a brief overview of the key functions.
Parsing documents
Postgres uses the tsvector
datatype to store text in a format that facilitates search. The ts_vector
function is used to convert text into this format.
Function | Description | Function Definition | Example Usage |
---|---|---|---|
| Convert text to a |
|
|
Parsing Queries
Postgres provides several functions to convert text input into tsquery
data types, which are used to search tsvector
data types. The most basic function is to_tsquery
, which expects query text format. Query text format consists of single tokens separated by the tsquery operators &
(and), |
(or), !
(not), and <->
(followed by). These can be grouped together using parentheses. The other functions plainto_tsquery
, phraseto_tsquery
, and websearch_to_tsquery
convert plain, unformatted text into tsquery
datatypes, joining words with different operators or applying web search conventions.
Function | Description | Function Definition | Example Usage |
---|---|---|---|
| Convert text that is formatted with query text rules to a |
|
|
| Convert plain text to a |
|
|
| Convert plain text to a |
|
|
| Convert text to a |
|
|
Ranking Search Results
Postgres provides functions to rank search results, such as ts_rank
and ts_rank_cd
.
Function | Description | Function Definition | Example Usage |
---|---|---|---|
| Rank search results using frequency |
|
|
| Rank search results using Cover Density Ranking |
|
|
Highlighting results
The ts_headline
function is used to highlight search results, providing context for the relevance of the search results.
Function | Description | Function Definition | Example Usage |
---|---|---|---|
| Highlights search results |
|
|
Example of Full Text Search
PostgreSQL's full-text search functionality allows us to perform complex text searches. This is achieved by converting documents, such as job descriptions, into vectors of the tsvector
data type. The search query is also converted into a tsquery
data type. The tsquery
data type is then used to search over the tsvector
data type for matches.
Searching for a single word
For instance, if we want to search for jobs where the description contains the word "React", we can use the following query. This query converts the job description
column to a tsvector
data type, converts the search term 'React' to a tsquery
data type, and uses the @@
operator to search for matches.
sql
SELECT * FROM jobs WHERE to_tsvector('english', description) @@ to_tsquery('react');
Searching for an Exact Phrase and an Additional Word
To search for jobs that contain the exact phrase "React Native" and also the word "Expo", we can use the websearch_to_tsquery
function. This function converts the search term to a tsquery
data type, ensuring that "React Native" is treated as a single entity and "Expo" is also included in the search.
sql
SELECT * FROM jobs WHERE to_tsvector('english', job_description) @@ websearch_to_tsquery('english', '"React Native" Expo');
Improving Search Performance with Indexing
To improve the performance of the above queries, we can add a generated column to the table and create an index on it. This allows the queries to use an index, making them faster.
sql
ALTER TABLE jobs ADD COLUMN description_tsvector tsvector GENERATED ALWAYS AS (to_tsvector('english', description)) STORED;
CREATE INDEX idx_description_tsvector ON jobs USING GIN (description_tsvector);
SELECT * FROM jobs WHERE description_tsvector @@ to_tsquery('react');
SELECT * FROM jobs WHERE description_tsvector @@ websearch_to_tsquery('english', '"React Native" Expo');
Ordering Results by Relevance
We may want to order the results by relevance. One way to approximate relevance is by frequency, which can be achieved using the ts_rank
function, which returns the number of matching lexemes (words or phrases) in the document.
sql
SELECT * FROM jobs WHERE description_tsvector @@ websearch_to_tsquery('english', 'expo')
ORDER BY description_tsvector @@ websearch_to_tsquery('english', 'expo') DESC;
Finally, to render the results with the matching section highlighted, we can use the ts_headline
function. This function highlights the words in the results that match the search term.
sql
postgres=# SELECT id, ts_headline('english', description, websearch_to_tsquery('english', 'expo')) AS description
FROM jobs WHERE description_tsvector @@ websearch_to_tsquery('english', 'expo');
id | description
----+------------------------------------------------------------------------------------------------------------------------
3 | <b>Expo</b>. The ideal candidate has experience with <b>Expo</b> or similar frameworks and is proficient in JavaScript
(1 row)
Vector search for sentence queries
Vector search is a powerful tool for improving the relevance of search results in PostgreSQL, particularly for sentence queries. It allows for ordering results based on semantic similarity, which can be more nuanced than keyword matching.
Ordering by vector distance
To order search results by vector distance, assuming there is a match, you can use the following query. This assumes that description_embedding
is a column storing vector embeddings of the job descriptions, and text_embedding
is a function that generates a vector embedding for the search query. Read more on how to generate an embedding column here, and how to generate one-off embeddings here.
sql
SELECT
id
FROM
jobs
WHERE
description_tsvector @@ websearch_to_tsquery('english', 'react native')
ORDER BY
description_embedding <-> text_embedding('BAAI/bge-small-en', 'react native')
Using vector search as a tiebreaker
Vector search can also be used as a tiebreaker for other ranking functions, such as the number of occurrences indicated by ts_rank
. This can be particularly useful when multiple results have similar keyword relevance.
sql
SELECT id FROM jobs WHERE description_tsvector @@ websearch_to_tsquery('english', 'react native')
ORDER BY
ts_rank(description_tsvector, to_tsquery('react native')),
description_embedding <-> text_embedding('BAAI/bge-small-en', 'react native'),
Personalized Search Results
Vector search is well-suited for personalized search results. By generating an embedding from a candidate's profile and comparing it against job embeddings, you can find jobs that are most similar to the candidate's interests and experiences.
Here's how you might sort results purely by personalized match, assuming there is a match:
sql
SELECT
id
FROM
jobs
WHERE
description_tsvector @@ websearch_to_tsquery('english', 'react native')
ORDER BY
description_embedding <-> (SELECT profile_embedding FROM candidates WHERE id = 1)
In this query, profile_embedding
would be a column in the candidates table that stores the vector embedding of the candidate's profile.
By leveraging vector search, Postgres can perform searches that consider the semantic meaning of sentences, providing more relevant and personalized search results. This is particularly useful for applications like job search platforms, where matching candidates with suitable job listings is crucial.
Putting it all together
In summary, Postgres, enhanced by the Lantern extension, provides a powerful toolkit for implementing advanced search features in applications. The use of tsvector
and tsquery
data types from Postgres full-text search, vector search from Lantern, and indexes to improve performance enables Postgres to meet a wide range of search requirements. It allows developers to create efficient, user-friendly search experiences, whether it's sifting through job listings or providing personalized recommendations.