Pinecone is a closed-source, cloud-native vector database. It allows you to efficiently search over vectors to find the closest matches to a query vector. For example, if you store vector representations of jobs, you can use Pinecone to quickly find the most similar jobs to a query job vector.
Pinecone is only a vector database, meaning you have to use it in addition to a traditional database like Postgres. You would need to store your jobs data in Postgres, and duplicate your data in Pinecone.
In contrast, Postgres is a general-purpose database that can store any type of data, including vectors. Lantern extends Postgres to support vector search, vector generation, and efficient indexing. With Lantern, you can store and query all of your data, including your vector data, in one place.
We built a Python library, lantern-pinecone
, to support migrating your data from Pinecone to Lantern in just a few lines of code. In this article, we'll describe how we built it, and how you can use it.
How data works in Pinecone vs. Lantern
Postgres
With a traditional database, data is stored in tables. Each table stores a specific type of data and contains various columns. For example, a jobs platform might have a jobs
table with columns for id
, title
, company
, country
, and embedding
, where embedding
is a vector column that stores the vector representation of the jobs. Similarly, the database might contain a users
table with columns for id
, name
, country
, and embedding
.
An index is a data structure that enables more efficient data querying. In Postgres, you can add indexes to a single column to more efficiently query over that column. For example, by indexing the embedding
column for the jobs
table, you can efficiently query for jobs that are similar to a given vector.
You can also add indexes over multiple columns, allowing for efficient queries over rows with a specific combination of values. For example, you can add an index over the country
and embedding
columns, which allows you to efficiently query for jobs in a specific country that are similar to a given vector.
Pinecone
Pinecone is a vector database. This means that it focuses on efficient querying over vectors. It does not allow you to query for other types of data, such as jobs in a specific country, or jobs that a user has applied to. This means that to build most applications, you would need to store your data in a traditional database like Postgres.
With Pinecone, data is stored in indexes, where each index contains vectors of the same dimensionality, typically generated from the same embedding model.
An index in Pinecone can be partitioned into namespaces. If the same embedding model is used for different types of data, such as jobs and users, these can be stored in separate namespaces within the same index. If different embedding models are used, jobs and users can be stored in separate indexes
The primary fields that Pinecone stores are ID and vector. However, Pinecone also supports storing other types of data in a generic metadata field. This field can store additional information, such as country
in our jobs example, which can be used for filtering when querying the index.
Our Pinecone export tool
Our goal was to create a seamless way to migrate indexes and namespaced indexes in Pinecone to indexed tables in Lantern. This would allow users to take advantage of the efficient vector querying capabilities of Pinecone while also benefiting from the flexibility and versatility of a traditional database like Postgres.
We built a Python client lantern-pinecone
to automatically migrate your data from Pinecone to Postgres with just a few lines of code.
For example, if you have an index called workatastartup
with a namespace called jobs
and a namespace called users
, you can follow the instructions below to automatically generate tables in Postgres called workatastartup_jobs
and workatastartup_users
with the same data.
Challenges
Pinecone does not have built-in support for exporting your Pinecone index, and advises companies to keep a copy of their source data outside of Pinecone.
Using the Pinecone's Python client pinecone-client
, it's possible to retrieve your data using the fetch
function if you have your data IDs. However, there is no API to retrieve your data IDs from Pinecone itself. The current workaround is to randomly generate 10,000 IDs at a time, until you have retrieved all of your data.
We abstract this step away using our client. If you don't have your IDs, simply leave the pinecone_ids
parameter blank, and we will automatically fetch your IDs from Pinecone using the workaround.
If you are currently storing under 10,000 rows in Pinecone, this should execute quickly. Unfortunately, if you have over 10,000 rows, it can take a longer time, depending on the number of rows you have.
Getting Started
To get started, install the lantern-pinecone
client
pip install lantern-pinecone
Then, you can use the create_from_pinecone
function to automatically migrate your data from Pinecone to Postgres by providing your Pinecone API key, environment, index name, and namespace. If you have the IDs, you can provide them here, or if not, we will automatically fetch them from Pinecone.
import lantern_pinecone
pinecone_ids = list(map(lambda x: str(x), range(100000)))
index = lantern_pinecone.create_from_pinecone(
api_key=<your_pinecone_api_key>,
environment=<your_pinecone_environment>, # e.g., 'us-west1-gcp'
index_name=<pinecone_index_name>, # e.g., 'my-index'
namespace=<pinecone_namespace>, # e.g., 'jobs'
pinecone_ids=pinecone_ids # e.g., ['1', '2', '3', ...]
)
This function initializes a Pinecone client with the given API key and environment, and copies the data from the index namespace to a new table in Postgres. You can optionally specify the HNSW index parameters m
, ef
, and ef_construction
to use for the index. By default, we use m=12
, ef=64
, and ef_construction=64
. You can use recreate
to re-do the migration if you have already migrated your data. You can also set create_lantern_index
to False
if you only want to migrate your data and do not want to create an HNSW index.
Interacting with your data
For people who are currently using Pinecone, we extended this client to support the same API as Pinecone. Hopefully, this makes the process of migrating from Pinecone to Lantern easier.
For example
import os
import lantern_pinecone
DB_URL = os.environ.get("DB_URL")
lantern_pinecone.init(DB_URL)
# Initialize the index
index_name = "test_index"
lantern_pinecone.create_index(name=index_name, dimension=3, metric="cosine")
index = lantern_pinecone.Index(index_name=index_name)
# Insert data into the index
data = [
("1", [1, 0, 0]),
("2", [0, 1, 0]),
("3", [0, 0, 1])
]
index.upsert(vectors=data)
# Check index stats
stats = index.describe_index_stats()
assert (stats['total_count'] == 3)
# Query the index
results = index.query(vector=[0, 1, 0], top_k=2, include_values=True)
assert (len(results['matches']) == 2)
assert (results['matches'][0]['id'] == "2")
# Delete the index
lantern_pinecone.delete_index(index_name)
assert (index_name not in lantern_pinecone.list_indexes())
Conclusion
Let us know what you think of our Pinecone to Postgres migration tool! If you're migrating from Pinecone to Postgres, or just trying to evaluate Postgres, we hope it makes your life easier. Let us you have any questions or feedback at support@lantern.dev