TLDR
When benchmarking Pinecone against Postgres at Lantern Cloud, we noticed that sometimes, in response to a nearest neighbor vector query, Pinecone returns vectors in the wrong order - it returns further points first before returning closer points to the query vector. This makes the scores returned by Pinecone unreliable in applications, so we think applications should always recalculate and re-sort returned results. We reported the issue here. There is a similar issue with point-queries reported by others here.
What is Pinecone?
Pinecone offers vector indexing and vector search via an easy-to-use API in its client libraries. The API allows inserting, updating vectors and running nearest neighbor queries. In addition, Pinecone allows specifying JSON metadata fields associated with a vector and filtering the nearest vectors via the metadata fields in search time.
Expectations on nearest neighbor query API
A nearest neighbor query in Pinecone returns the nearest vector IDs and a ‘score’, which, according to their docs, is a generalization of distance metrics. We thought it would be safe to assume that among the returned vectors, closer vectors would get better scores (using the word ‘better’ here, since in some cases higher score corresponds to a closer vector, and in other cases lower score corresponds to a closer vector). Let’s visualize the setup.
Let’s assume points A,B,C,D have the layout above. And let’s assume that when querying for A, the IDs of vectors B and C are returned. Note that B is closer to the query vector A, so it gets a better score. (Note that per approximate nearest neighbor algorithms, it is legal for D to not be in the return set, since neighborhoods are approximate).
Problem with Pinecone’s implementation of nearest neighbor API
When benchmarking Pinecone against Postgres at Lantern Cloud, we noticed that sometimes, in response to a nearest neighbor vector query, Pinecone returns vectors in the wrong order - it returns further points first before returning closer points to the query vector. So, with the example above, sometimes Pinecone returns C before returning B, and assigns C a better score than to B. As a result, you cannot rely on the returned score values to know which returned point is actually closer to A.
To reproduce the problem on Pinecone, you can use the following Jupyter notebook, which has example output saved (in case the issue is fixed by the time you read it and before we have had a chance to update this post to reflect the fix).
We found this very strange, and we did not see this kind of behavior in any other vector store. We reported the issue to Pinecone here and will do our best to escalate it further and get it addressed.
In the meantime, it probably is best to not rely on these scores and sort returned elements at application side.
Final thoughts and recommendation
If you are using Postgres as your main database, the relevant query will change from
SELECT description
FROM entitites
WHERE id IN ${pinecone_result_ids}
to
SELECT description,
vector <-> ${query_vector_sent_to_pinecone} as score
FROM entitites
WHERE id IN ${pinecone_result_ids}
ORDER BY 2 ASC
The Pinecone API is very similar to what is available in Postgres, and it is enabled via extensions such as lantern and pgvector. So, you can likely achieve identical results by just using Postgres and a query like the one below (note that the query below is the same as above, without the WHERE
constraint):
SELECT description,
vector <-> ${query_vector_sent_to_pinecone} as score
FROM entitites
ORDER BY 2 ASC