Migrating from Pinecone to Lantern

January 6, 2024 · 6 min read

Di Qi

Di Qi

Cofounder

Pinecone is a closed-source, cloud-native vector database. It allows you to efficiently search over vectors to find the closest matches to a query vector. For example, if you store vector representations of jobs, you can use Pinecone to quickly find the most similar jobs to a query job vector.

Pinecone is only a vector database, meaning you have to use it in addition to a traditional database like Postgres. You would need to store your jobs data in Postgres, and duplicate your data in Pinecone.

In contrast, Postgres is a general-purpose database that can store any type of data, including vectors. Lantern extends Postgres to support vector search, vector generation, and efficient indexing. With Lantern, you can store and query all of your data, including your vector data, in one place.

We built a Python library, lantern-pinecone, to support migrating your data from Pinecone to Lantern in just a few lines of code. In this article, we'll describe how we built it, and how you can use it.

How data works in Pinecone vs. Lantern

Postgres

With a traditional database, data is stored in tables. Each table stores a specific type of data and contains various columns. For example, a jobs platform might have a jobs table with columns for id, title, company, country, and embedding, where embedding is a vector column that stores the vector representation of the jobs. Similarly, the database might contain a users table with columns for id, name, country, and embedding.

An index is a data structure that enables more efficient data querying. In Postgres, you can add indexes to a single column to more efficiently query over that column. For example, by indexing the embedding column for the jobs table, you can efficiently query for jobs that are similar to a given vector.

You can also add indexes over multiple columns, allowing for efficient queries over rows with a specific combination of values. For example, you can add an index over the country and embedding columns, which allows you to efficiently query for jobs in a specific country that are similar to a given vector.

Pinecone

Pinecone is a vector database. This means that it focuses on efficient querying over vectors. It does not allow you to query for other types of data, such as jobs in a specific country, or jobs that a user has applied to. This means that to build most applications, you would need to store your data in a traditional database like Postgres.

With Pinecone, data is stored in indexes, where each index contains vectors of the same dimensionality, typically generated from the same embedding model.

An index in Pinecone can be partitioned into namespaces. If the same embedding model is used for different types of data, such as jobs and users, these can be stored in separate namespaces within the same index. If different embedding models are used, jobs and users can be stored in separate indexes

The primary fields that Pinecone stores are ID and vector. However, Pinecone also supports storing other types of data in a generic metadata field. This field can store additional information, such as country in our jobs example, which can be used for filtering when querying the index.

Our Pinecone export tool

Our goal was to create a seamless way to migrate indexes and namespaced indexes in Pinecone to indexed tables in Lantern. This would allow users to take advantage of the efficient vector querying capabilities of Pinecone while also benefiting from the flexibility and versatility of a traditional database like Postgres.

We built a Python client lantern-pinecone to automatically migrate your data from Pinecone to Postgres with just a few lines of code.

For example, if you have an index called workatastartup with a namespace called jobs and a namespace called users, you can follow the instructions below to automatically generate tables in Postgres called workatastartup_jobs and workatastartup_users with the same data.

Challenges

Pinecone does not have built-in support for exporting your Pinecone index, and advises companies to keep a copy of their source data outside of Pinecone.

Using the Pinecone's Python client pinecone-client, it's possible to retrieve your data using the fetch function if you have your data IDs. However, there is no API to retrieve your data IDs from Pinecone itself. The current workaround is to randomly generate 10,000 IDs at a time, until you have retrieved all of your data.

We abstract this step away using our client. If you don't have your IDs, simply leave the pinecone_ids parameter blank, and we will automatically fetch your IDs from Pinecone using the workaround.

If you are currently storing under 10,000 rows in Pinecone, this should execute quickly. Unfortunately, if you have over 10,000 rows, it can take a longer time, depending on the number of rows you have.

Getting Started

To get started, install the lantern-pinecone client

bash
Copy
pip install lantern-pinecone

Then, you can use the create_from_pinecone function to automatically migrate your data from Pinecone to Postgres by providing your Pinecone API key, environment, index name, and namespace. If you have the IDs, you can provide them here, or if not, we will automatically fetch them from Pinecone.

python
Copy
import lantern_pinecone

pinecone_ids = list(map(lambda x: str(x), range(100000)))
index = lantern_pinecone.create_from_pinecone(
    api_key=<your_pinecone_api_key>,
    environment=<your_pinecone_environment>, # e.g., 'us-west1-gcp'
    index_name=<pinecone_index_name>,        # e.g., 'my-index'
    namespace=<pinecone_namespace>,          # e.g., 'jobs'
    pinecone_ids=pinecone_ids                # e.g., ['1', '2', '3', ...]
)

This function initializes a Pinecone client with the given API key and environment, and copies the data from the index namespace to a new table in Postgres. You can optionally specify the HNSW index parameters m, ef, and ef_construction to use for the index. By default, we use m=12, ef=64, and ef_construction=64. You can use recreate to re-do the migration if you have already migrated your data. You can also set create_lantern_index to False if you only want to migrate your data and do not want to create an HNSW index.

Interacting with your data

For people who are currently using Pinecone, we extended this client to support the same API as Pinecone. Hopefully, this makes the process of migrating from Pinecone to Lantern easier.

For example

python
Copy
import os
import lantern_pinecone

DB_URL = os.environ.get("DB_URL")
lantern_pinecone.init(DB_URL)

# Initialize the index
index_name = "test_index"
lantern_pinecone.create_index(name=index_name, dimension=3, metric="cosine")
index = lantern_pinecone.Index(index_name=index_name)

# Insert data into the index
data = [
    ("1", [1, 0, 0]),
    ("2", [0, 1, 0]),
    ("3", [0, 0, 1])
]
index.upsert(vectors=data)

# Check index stats
stats = index.describe_index_stats()
assert (stats['total_count'] == 3)

# Query the index
results = index.query(vector=[0, 1, 0], top_k=2, include_values=True)
assert (len(results['matches']) == 2)
assert (results['matches'][0]['id'] == "2")

# Delete the index
lantern_pinecone.delete_index(index_name)
assert (index_name not in lantern_pinecone.list_indexes())

Conclusion

Let us know what you think of our Pinecone to Postgres migration tool! If you're migrating from Pinecone to Postgres, or just trying to evaluate Postgres, we hope it makes your life easier. Let us you have any questions or feedback at support@lantern.dev

Share this post