Company

Full Text Search + Vector Search with Postgres

December 16, 2023 ยท 11 min read

Search is a fundamental need for many applications. Consider a jobs platform like Y Combinator's Work at a Startup. Candidates use Work at a Startup to find for their next job. With thousands of jobs on the platform, candidates need an efficient way to sift through all of them to find jobs relevant to them. They may want to filter jobs based on location and salary, search for specific keywords, see most recent jobs first, or even receive personalized recommendations.

If you're using a database like Postgres, all of these search features are fully supported with Postgres and Postgres extensions. In this article, we'll explain how to leverage Postgres and the Lantern Postgres extension to support search, using job search as an example.

Setting up the database

In this article, we will assume the following simple setup. First, we create a table named jobs to store the job postings, with each job having a unique id, a title, a description, the country where the job is located, and the time the job was created created_at.

sql

CREATE TABLE jobs (
    id SERIAL PRIMARY KEY,
    title TEXT,
    description TEXT,
    country TEXT,
    created_at TIMESTAMPTZ DEFAULT NOW()
);

Next, we populate the jobs table with some data:

sql

INSERT INTO jobs (title, description, country)
VALUES ('Front End Developer', 'We are looking for a React Developer to join our team. They should have experience with React or other frontend frameworks.', 'US'),
       ('Software Engineer', 'We are looking for a generalist Software Engineer to join our team. Must have minimum three years of experience.', 'CA'),
       ('Software Engineer', 'Our company builds mobile apps using Expo. The ideal candidate has experience with Expo or similar frameworks and is proficient in JavaScript or TypeScript.', 'UK'),
       ('Data Scientist', 'Seeking a Data Scientist with expertise in machine learning and big data technologies. Familiarity with Python and R is required.', 'AU'),
       ('Project Manager', 'We require a Project Manager with experience in software development projects. Excellent communication and leadership skills are essential.', 'US');

Using columns for search

You can use the inherent structure of your tables for search. On our case, we have a jobs table with title, description, country, and created_at columns. These columns can be used to filter results based on specific criteria. For example, if a candidate wants to find jobs located in the US posted in the last month, we can use the country and created_at columns to filter the results:

sql

SELECT * FROM jobs WHERE country = 'US' AND created_at > NOW() - INTERVAL '1 month';

To improve the performance of such queries, especially when dealing with large datasets, we can create an index. Indexing can significantly speed up data retrieval as it allows the database to quickly locate the data without having to scan every row in the table. For example, below we create an index on the country column to speed up queries like SELECT * FROM jobs WHERE country = 'US'.

sql

CREATE INDEX idx_jobs_country ON jobs (country);

Regex search for simple word queries

Postgres supports word search using regex matching, which can be useful for simple keyword queries. For instance, if a user wants to find jobs where the description contains the word "React", we can use the following query:

sql

SELECT * FROM jobs WHERE description ~* 'react';

However, regex matching involves sequential search and can be slow, especially for large datasets. It's also case-sensitive and doesn't support linguistic features such as stemming (e.g., including "program" results when the query is "programming") or stop word removal (e.g., removing "the" and "a" from the query). For more complex queries, full text search is a better option as it supports these features and is generally faster and more efficient.

Overview of Full Text Search API

Postgres provides a comprehensive set of functions for full text search. Detailed documentation can be found in the official Postgres docuemntation. Here, we provide a brief overview of the key functions.

Parsing documents

Postgres uses the tsvector datatype to store text in a format that facilitates search. The ts_vector function is used to convert text into this format.

Function

Description

Function Definition

Example Usage

to_tsvector

Convert text to a tsvector data type

to_tsvector([ config regconfig, ] document text) returns tsvector

to_tsvector('english', 'The quick brown fox jumps over the lazy dog.')

Parsing Queries

Postgres provides several functions to convert text input into tsquery data types, which are used to search tsvector data types. The most basic function is to_tsquery, which expects query text format. Query text format consists of single tokens separated by the tsquery operators & (and), | (or), ! (not), and <-> (followed by). These can be grouped together using parentheses. The other functions plainto_tsquery, phraseto_tsquery, and websearch_to_tsquery convert plain, unformatted text into tsquery datatypes, joining words with different operators or applying web search conventions.

Function

Description

Function Definition

Example Usage

to_tsquery

Convert text that is formatted with query text rules to a tsquery data type

to_tsquery([ config regconfig, ] querytext text) returns tsquery

to_tsquery('english', 'react')

plainto_tsquery

Convert plain text to a tsquery data type, joining words with &

plainto_tsquery([ config regconfig, ] querytext text) returns tsquery

plainto_tsquery('react software engineer')

phraseto_tsquery

Convert plain text to a tsquery data type, joining words with <->

phraseto_tsquery([ config regconfig, ] querytext text) returns tsquery

phraseto_tsquery('react native')

websearch_to_tsquery

Convert text to a tsquery data type, converting quoted text to <->, "or" to \|, "-" to !, and joining other text with &

websearch_to_tsquery([ config regconfig, ] query text) returns tsquery

websearch_to_tsquery('english', '"react native" -expo')

Ranking Search Results

Postgres provides functions to rank search results, such as ts_rank and ts_rank_cd.

Function

Description

Function Definition

Example Usage

ts_rank

Rank search results using frequency

ts_rank([ weights float4[], ] vector tsvector, query tsquery [, normalization integer ]) returns float4

ts_rank(to_tsvector('english', 'The quick brown fox jumps over the lazy dog.'), to_tsquery('foxes'))

ts_rank_cd

Rank search results using Cover Density Ranking

ts_rank_cd([ weights float4[], ] vector tsvector, query tsquery [, normalization integer ]) returns float4

ts_rank_cd(to_tsvector('english', 'The quick brown fox jumps over the lazy dog.'), to_tsquery('foxes'))

Highlighting results

The ts_headline function is used to highlight search results, providing context for the relevance of the search results.

Function

Description

Function Definition

Example Usage

ts_headline

Highlights search results

ts_headline([ config regconfig, ] document text, query tsquery [, start_offset integer, stop_offset integer ]) returns text

ts_headline('The quick brown fox jumps over the lazy dog.', to_tsquery('foxes'))

Example of Full Text Search

PostgreSQL's full-text search functionality allows us to perform complex text searches. This is achieved by converting documents, such as job descriptions, into vectors of the tsvector data type. The search query is also converted into a tsquery data type. The tsquery data type is then used to search over the tsvector data type for matches.

Searching for a single word

For instance, if we want to search for jobs where the description contains the word "React", we can use the following query. This query converts the job description column to a tsvector data type, converts the search term 'React' to a tsquery data type, and uses the @@ operator to search for matches.

sql

SELECT * FROM jobs WHERE to_tsvector('english', description) @@ to_tsquery('react');

Searching for an Exact Phrase and an Additional Word

To search for jobs that contain the exact phrase "React Native" and also the word "Expo", we can use the websearch_to_tsquery function. This function converts the search term to a tsquery data type, ensuring that "React Native" is treated as a single entity and "Expo" is also included in the search.

sql

SELECT * FROM jobs WHERE to_tsvector('english', job_description) @@ websearch_to_tsquery('english', '"React Native" Expo');

Improving Search Performance with Indexing

To improve the performance of the above queries, we can add a generated column to the table and create an index on it. This allows the queries to use an index, making them faster.

sql

ALTER TABLE jobs ADD COLUMN description_tsvector tsvector GENERATED ALWAYS AS (to_tsvector('english', description)) STORED;
CREATE INDEX idx_description_tsvector ON jobs USING GIN (description_tsvector);

SELECT * FROM jobs WHERE description_tsvector @@ to_tsquery('react');
SELECT * FROM jobs WHERE description_tsvector @@ websearch_to_tsquery('english', '"React Native" Expo');

Ordering Results by Relevance

We may want to order the results by relevance. One way to approximate relevance is by frequency, which can be achieved using the ts_rank function, which returns the number of matching lexemes (words or phrases) in the document.

sql

SELECT * FROM jobs WHERE description_tsvector @@ websearch_to_tsquery('english', 'expo')
  ORDER BY description_tsvector @@ websearch_to_tsquery('english', 'expo') DESC;

Finally, to render the results with the matching section highlighted, we can use the ts_headline function. This function highlights the words in the results that match the search term.

sql

postgres=# SELECT id, ts_headline('english', description, websearch_to_tsquery('english', 'expo')) AS description
  FROM jobs WHERE description_tsvector @@ websearch_to_tsquery('english', 'expo');
 id |                                                      description
----+------------------------------------------------------------------------------------------------------------------------
  3 | <b>Expo</b>. The ideal candidate has experience with <b>Expo</b> or similar frameworks and is proficient in JavaScript
(1 row)

Vector search for sentence queries

Vector search is a powerful tool for improving the relevance of search results in PostgreSQL, particularly for sentence queries. It allows for ordering results based on semantic similarity, which can be more nuanced than keyword matching.

Ordering by vector distance

To order search results by vector distance, assuming there is a match, you can use the following query. This assumes that description_embedding is a column storing vector embeddings of the job descriptions, and text_embedding is a function that generates a vector embedding for the search query. Read more on how to generate an embedding column here, and how to generate one-off embeddings here.

sql

SELECT
  id
FROM
  jobs
WHERE
  description_tsvector @@ websearch_to_tsquery('english', 'react native')
ORDER BY
  description_embedding <-> text_embedding('BAAI/bge-small-en', 'react native')

Using vector search as a tiebreaker

Vector search can also be used as a tiebreaker for other ranking functions, such as the number of occurrences indicated by ts_rank. This can be particularly useful when multiple results have similar keyword relevance.

sql

SELECT id FROM jobs WHERE description_tsvector @@ websearch_to_tsquery('english', 'react native')
ORDER BY
  ts_rank(description_tsvector, to_tsquery('react native')),
  description_embedding <-> text_embedding('BAAI/bge-small-en', 'react native'),

Personalized Search Results

Vector search is well-suited for personalized search results. By generating an embedding from a candidate's profile and comparing it against job embeddings, you can find jobs that are most similar to the candidate's interests and experiences.

Here's how you might sort results purely by personalized match, assuming there is a match:

sql

SELECT
  id
FROM
  jobs
WHERE
  description_tsvector @@ websearch_to_tsquery('english', 'react native')
ORDER BY
  description_embedding <-> (SELECT profile_embedding FROM candidates WHERE id = 1)

In this query, profile_embedding would be a column in the candidates table that stores the vector embedding of the candidate's profile.

By leveraging vector search, Postgres can perform searches that consider the semantic meaning of sentences, providing more relevant and personalized search results. This is particularly useful for applications like job search platforms, where matching candidates with suitable job listings is crucial.

Putting it all together

In summary, Postgres, enhanced by the Lantern extension, provides a powerful toolkit for implementing advanced search features in applications. The use of tsvector and tsquery data types from Postgres full-text search, vector search from Lantern, and indexes to improve performance enables Postgres to meet a wide range of search requirements. It allows developers to create efficient, user-friendly search experiences, whether it's sifting through job listings or providing personalized recommendations.

Authors

Di Qi

Di Qi

Cofounder

Share this post