Picking the right embedding model for your vector database

When working with vector databases, it’s tempting to pick the first or most popular model you can find. You don’t want to waste time experimenting when you could be making progress, right?

Today, we’re going to show you

Why this isn’t the best idea, and
Why you should test different models for your use case.

To do that, we’ll walk through a real-world example using two different embedding models. One was trained on code-specific data, and the other on one-size-fits-all generic training data. We’ll use these two training sets to build a code snippet search engine, and see how they perform. If you’re interested in the details, or want to run it yourself, here’s our Colab notebook with the step-by-step. It’s also got more examples.

How to get started

1. Set up our vector database, Lantern

We’ll be using Lantern, a faster, cheaper alternative to Pinecone, as our vector database.

2. Load our dataset

The dataset has 80+ code snippets across 9 programming languages:

C
C++
C#
Go
Java
Javascript
PHP
Python
Ruby

The code snippets fall into three categories:

File handling code (tagged FILE)
Networking code that makes http requests (tagged NETWORK)
Code related to building user interfaces (tagged GUI)

3. Connect to postgres and enable Lantern, then set up our data table

Lantern connects via Postgres, which is helpful if we’re dealing with many types of data.

4. Load both our models

We’ll use two models. One was trained on code snippets, and supports all 9 languages we’re searching. We’ll call this “Model A”.

The second model is generic, trained on hundreds of millions of sentence pairs, with very little exposure to code. We’ll call this “Model B”.

Both of these models have ~110 million parameters.

5. Index and insert our data into the table

6. Write the functions that will search the dataset

The only difference between the two functions is the models they search. One will search Model A, the other Model B. We’ll also add a helper function to print the results in a readable format.

Now, we’re ready to run our queries!

Now that we’re set up, let’s test our embedding models

Let’s look at an example where both models do well

We’ll start simple - ask both models to find a code snippet that performs any HTTP request. Anything that falls under the NETWORK task category. Note we’re using natural language in the request.

Spoiler: both do a great job.

pythonCopy
query = "send a http request"

Model A

Rank	Language	Task	Subtask
1	RUBY	NETWORK	GET
2	RUBY	NETWORK	PUT
3	RUBY	NETWORK	POST
4	PY	NETWORK	GET
5	PY	NETWORK	PUT
6	PY	NETWORK	POST
7	PHP	NETWORK	GET
8	PHP	NETWORK	POST

Top 3 snippets

Ruby code for GET request

require 'net/http'

uri = URI('http://get.prod9.api.ourservice.com/data')

response = Net::HTTP.get(uri)

puts response
Ruby code for PUT request

require 'net/http'

uri = URI('http://put.prod9.api.ourservice.com/data')

http = Net::HTTP.new(uri.host, uri.port)

request = Net::HTTP::Put.new(uri)

request.set_form_data('name' => 'example', 'value' => '123')

response = http.request(request)

puts response.body
Ruby code for POST request

require 'net/http'

uri = URI('http://post.prod9.api.ourservice.com/data')

response = Net::HTTP.post_form(uri, 'name' => 'example', 'value' => '123')

puts response.body

Model B

Rank	Language	Task	Subtask
1	GO	NETWORK	PUT
2	RUBY	NETWORK	PUT
3	GO	NETWORK	POST
4	RUBY	NETWORK	GET
5	PHP	NETWORK	PUT
6	RUBY	NETWORK	POST
7	JS	NETWORK	PUT
8	PHP	NETWORK	POST

Top 3 snippets

Go code for PUT request

package main

import ("bytes" "net/http")

func main() {

client := &http.Client{}

data := []byte("name=example&value=123")

req, _ := http.NewRequest("PUT", "http://put.prod4.api.ourservice.com/data", bytes.NewBuffer(data))

req.Header.Set("Content-Type", "application/x-www-form-urlencoded")

_, _ = client.Do(req)

}
Ruby code for PUT request

require 'net/http'

uri = URI('http://put.prod9.api.ourservice.com/data')

http = Net::HTTP.new(uri.host, uri.port)

request = Net::HTTP::Put.new(uri)

request.set_form_data('name' => 'example', 'value' => '123')

response = http.request(request)

puts response.body
Go code for POST request

package main

import ("bytes" "net/http")

func main() {

data := []byte("name=example&value=123")

_, _ = http.Post("http://post.prod4.api.ourservice.com/data", "application/x-www-form-urlencoded", bytes.NewBuffer(data))

}

Model A can outperform Model B with natural language requests

Let’s get a bit more specific. Instead of any network request, let’s limit it to Python. So, any Python language NETWORK request.

pythonCopy
query = "Invoke our API with Python"

Model A

Rank	Language	Task	Subtask
1	PY	NETWORK	POST
2	PY	NETWORK	GET
3	PY	NETWORK	PUT
4	RUBY	NETWORK	GET
5	GO	NETWORK	POST
6	GO	NETWORK	GET
7	JS	NETWORK	GET
8	GO	NETWORK	PUT

Top 3 snippets

Python code for POST request

import requests

payload = {'name': 'example', 'value': '123'}

response = requests.post('http://post.prod8.api.ourservice.com/data', data=payload)

print(response.text)
Python code for GET request

import requests

response = requests.get('http://get.prod8.api.ourservice.com/data')

print(response.text)
Python code for PUT request

import requests

payload = {'name': 'example', 'value': '123'}

response = requests.put('http://put.prod8.api.ourservice.com/data', data=payload)

print(response.text)

Model B

Rank	Language	Task	Subtask
1	RUBY	NETWORK	GET
2	PY	NETWORK	GET
3	RUBY	NETWORK	PUT
4	PY	NETWORK	PUT
5	CSHARP	NETWORK	GET
6	JS	NETWORK	GET
7	PY	NETWORK	POST
8	PHP	NETWORK	GET

Top 3 snippets

Ruby code for GET request

require 'net/http'

uri = URI('http://get.prod9.api.ourservice.com/data')

response = Net::HTTP.get(uri)

puts response
Python code for GET request

import requests

response = requests.get('http://get.prod8.api.ourservice.com/data')

print(response.text)
Ruby code for PUT request

require 'net/http'

uri = URI('http://put.prod9.api.ourservice.com/data')

http = Net::HTTP.new(uri.host, uri.port)

request = Net::HTTP::Put.new(uri)

request.set_form_data('name' => 'example', 'value' => '123')

response = http.request(request)

puts response.body

As you can see, things change. Now, model B struggles with understanding what Python actually means. Model A is 3/3 with all top 3 results being Python code snippets.

Why? Model B, being more general, doesn’t fully understand what Python is in the context of its data. It gets close - note that it favors Ruby, which is syntactically like Python. And it does return NETWORK requests, which is what we asked for. But Model A does much better.

...and Model A’s code snippet understanding can help avoid distractions

Let’s move beyond natural language. This time we’ll query with just code, and we’ll try to trick the models.

We’ll do this with a Rust snippet that makes a NETWORK request with a payload of File: UNSPECIFIED. Model A wasn’t trained with Rust, so this should be interesting.

pythonCopy
query = """async fn a() -> Result<(), reqwest::Error> {
   let c = reqwest::Client::new();
   let u = BACKEND_API_URL;
   let b = "File: UNSPECIFIED";
   let r = c.put(u).body(b).send().await?;
   println!("Status: {}", r.status());
   Ok(())
}"""

...but Model B gives more relevant results in key situations

Let's run a query to get snippets that open a file.

pythonCopy
query = "open a file"

Model A

Rank	Language	Task	Subtask
1	RUBY	FILE	APPEND
2	PHP	GUI	FILE
3	PHP	GUI	URL
4	C	FILE	APPEND
5	PY	GUI	FILE
6	CPP	FILE	APPEND
7	RUBY	FILE	READ
8	PY	FILE	READ

Top 3 snippets

Ruby code for appending to a file

File.open('data.txt', 'a') { |file| file.puts('More data') }
PHP code for opening a file

$fileButton = new GtkButton('Select File');

$fileButton->connect_simple('clicked', function() { echo "File button clicked\n"; });
PHP code for opening a URL

$urlButton = new GtkButton('Open URL');

$urlButton->connect_simple('clicked', function() { echo "URL button clicked\n"; });

Model B

Rank	Language	Task	Subtask
1	RUBY	FILE	APPEND
2	PY	FILE	READ
3	RUBY	FILE	READ
4	PY	FILE	WRITE
5	PY	FILE	APPEND
6	RUBY	FILE	WRITE
7	C	FILE	READ
8	C	FILE	WRITE

Top 3 snippets

Ruby code for appending to a file

File.open('data.txt', 'a') { |file| file.puts('More data') }
Python code for reading a file

with open('data.txt', 'r') as file:

data = file.read()

print(data)
Ruby code for reading a file

File.open('data.txt', 'r') do |file|

while line = file.gets

puts line

end

end

Model B keeps results to the FILE task, which is what we were looking for.

For some reason, Model A is more scattered. A closer look at its second result shows us why:

$fileButton = new GtkButton('Select File');

$fileButton->connect_simple('clicked', function() { echo "File button clicked\n"; });

This snippet is a GUI button, but the code is for selecting a file to upload.

So Model A’s highly contextual understanding of code led it down the wrong path. It knew that code was for opening a file, but that gave us the wrong result.

Conclusion - which embedding model is right for us?

By now it should be clear that each model had its own advantages, depending on the task we gave it.

The domain-specific model isn’t always the best performing. Sometimes, the general model outperformed the code-specific model because it wasn’t distracted by context.

The best way to find out which model works for you is to experiment!

And... the easiest way for you to generate those embeddings is with Lantern Cloud. Just select a column, pick your model, and Lantern Cloud handles the rest. If you’d like help getting started, or advice on working with embeddings, we’re also here to help. Please get in touch at support@lantern.dev.

Share this post