Benchmarking

90x faster than pgvector — Lantern’s HNSW Index Creation Time

October 20, 2023 · 6 min read

Hi, I'm Varik. I'm a core database engineer at Lantern sharing some of my recent work on database internals.

Today, we dig into the very exciting world of index creation times — a critical database metric — and how Lantern has ~90x faster index creation times than key competitors on the market!

Here’s an overview of this post:

  • Why fast index creation matters to developers
  • Lantern’s single-core index creation performance
  • How Lantern enables 90x faster than pgvector index creation via external indexing

Fast index creation time leads to better applications

Lantern is a PostgreSQL vector database extension to help developers build AI applications. Lantern is built on Usearch, a state-of-the-art implementation of HNSW (Hierarchical Navigable Small World), which is the most scalable and performant algorithm for handling vector search.

HNSW relies on two parameters for index construction.

  • The M parameter controls the number of neighbors that each node in the graph will maintain.
  • The ef_construction parameter determines how many nodes will be traversed during index construction. Higher values lead to longer construction times but result in a higher-quality index.

Developers today generate multiple indexes to find the set of parameters that best match the size of data and the application’s needs, and index creation speeds matter because long index creation times lead to slow experimentation and slow application development.

Without parallelism, Lantern already outperforms Pinecone and Pgvector

For this demonstration, we use sift_base_1m dataset and wiki dataset generated with text-embedding-ada-002 model. The sift dataset has 128 dimensional vectors and wiki dataset has 1536 dimensional vectors.

First, create a table:

sql

copy code

CREATE TABLE wiki1m (id SERIAL, v REAL[]);
COPY wiki1m (v) FROM '/tmp/wiki1m.csv' WITH CSV;

Next, create an HNSW index on this data using the Lantern extension:

sql

copy code

CREATE EXTENSION IF NOT EXISTS lantern;
CREATE INDEX ON wiki1m USING hnsw (v) WITH (dim=128, m=8, ef_construction=128, ef=128);

On a Linode instance with 32 Cores and 64GB of RAM, the index creation process take the following time:

Tool

Sift

Vec/s (Sift)

Wiki

Vec/s (Wiki)

Lantern

8m 30s

~1960 vec/s

44m

~382 vec/s

Pinecone (p2x2 - 1 pod)

9m

~1818 vec/s

30m

~555 vec/s

Pgvector

46m

~361 vec/s

2h

~140 vec/s

Index Creation Speed (Single Core)

Using multi-core index creation, Lantern is 90x faster than pgvector

  • 17x performance improvement compared to creating the index on a single thread.
  • 90x performance improvement compared to pgvector, and a 6x improvement over Pinecone for sift dataset
  • 48x performance improvement over pgvector, as well as a 3x improvement over Pinecone with 32 pods
  • Pinecone index on 32 p2 pods costs $3,889.44 / month. 32 CPU Linode costs $576 / month. Lantern is over 6x cheaper and 6x faster!

Index Creation Speed (Multi Core)

Fine-tuning the recall using parameters m=16, ef_construction=128, and ef=128 for Lantern, we can achieve a 99% recall@5 for sift dataset, with the index creation taking only 50 seconds.

Notice also that a 2 Core Linode Server running Lantern outperforms Pinecone’s 32 pod cluster - this means that Lantern can be 60x cheaper than Pinecone for the same performance.

Index creation speed with Lantern Multi-Core Index

How external index creation enables parallelism

Lantern is the only vector database on the market that enables developers to create an index externally and then import the index as a file into the PostgreSQL database. This approach allows us to have the best index creation performance on the market.

External index creation is exciting because the core database remains unburdened during index creation, and performance improvements unlock faster optimization and a better developer experience.

Here's how our external index creation works:

  1. Create an index from a regular postgres table via the lantern-create-index utility
  2. Load the index into the database using the _experimental_index_path option.

Here is a walk through:

bash

copy code

wget https://github.com/lanterndata/lantern_extras/releases/download/0.0.1/lantern-create-index-0.0.1-amd64.tar
tar xf lantern-create-index-0.0.1-amd64.tar
./lantern-create-index --help

You can also build it from source following the docs. We will soon publish the binary to crates.io, so you can just use cargo install lantern-create-index to install it

Next, use the lantern-create-index binary to create our index:

bash

copy code

lantern-create-index \
    --uri postgresql://postgres@localhost:5432/testlive \
    --table wiki \
    --column v \
    -m 8 \
    --efc 128 \
    --ef 64 \
    -d 128 \
    --metric-kind cos \
    --out index.usearch

The index parameters remain the same, except for the --uri and --out arguments, which specify the database connection URI and output file name. The --metric-kind parameter determines the distance function used when calculating distances for vectors. In Lantern, you can define this using operator classes when creating the index. lantern-create-index creates the index in parallel, harnessing all the available cores.

Once we have the index file, index.usearch, we can import it into our Postgres database, obtaining a fully functional HNSW index for our data:

bash

copy code

CREATE INDEX ON sift1m USING hnsw (v) WITH (_experimental_index_path='/tmp/index.usearch');

Make sure that the user running postgres database has access to this file. If it is postgres user you can do chown postgres:postgres /tmp/index.usearch

You might be curious, how do we match the index with Postgres tuples under the hood? Here’s our secret: We use the row's ctid as a label for our nodes during index creation. Later, this label is used to retrieve the actual row from Postgres, as it represents the physical location of the row.

In an upcoming release, we’re planning to address the following concerns with this implementation:

  • Additional rows may be added to the table while creating and importing the index, careful handling is important
  • Row ctid's may change, causing our index to point to incorrect rows within its graphs

To test Lantern yourself, visit our repository here.

Hope you enjoyed this post, we are excited to keep releasing performance upgrades and tools to support developers.

If you’d like to test this code for your production application, we can offer 1:1 technical support, please email [support@lantern.dev](mailto:support@lantern.dev)

Authors

Varik Matevosyan

Varik Matevosyan

Software Engineer

Share this post