Embedding·Learn
Picking the right embedding model for your vector database
January 15, 2024 · 12 min read
When working with vector databases, it’s tempting to pick the first or most popular model you can find. You don’t want to waste time experimenting when you could be making progress, right?
Today, we’re going to show you
- Why this isn’t the best idea, and
- Why you should test different models for your use case.
To do that, we’ll walk through a real-world example using two different embedding models. One was trained on code-specific data, and the other on one-size-fits-all generic training data. We’ll use these two training sets to build a code snippet search engine, and see how they perform. If you’re interested in the details, or want to run it yourself, here’s our Colab notebook with the step-by-step. It’s also got more examples.
How to get started
1. Set up our vector database, Lantern
We’ll be using Lantern, a faster, cheaper alternative to Pinecone, as our vector database.
2. Load our dataset
The dataset has 80+ code snippets across 9 programming languages:
- C
- C++
- C#
- Go
- Java
- Javascript
- PHP
- Python
- Ruby
The code snippets fall into three categories:
- File handling code (tagged FILE)
- Networking code that makes http requests (tagged NETWORK)
- Code related to building user interfaces (tagged GUI)
3. Connect to postgres and enable Lantern, then set up our data table
Lantern connects via Postgres, which is helpful if we’re dealing with many types of data.
4. Load both our models
We’ll use two models. One was trained on code snippets, and supports all 9 languages we’re searching. We’ll call this “Model A”.
The second model is generic, trained on hundreds of millions of sentence pairs, with very little exposure to code. We’ll call this “Model B”.
Both of these models have ~110 million parameters.
5. Index and insert our data into the table
6. Write the functions that will search the dataset
The only difference between the two functions is the models they search. One will search Model A, the other Model B. We’ll also add a helper function to print the results in a readable format.
Now, we’re ready to run our queries!
Now that we’re set up, let’s test our embedding models
Let’s look at an example where both models do well
We’ll start simple - ask both models to find a code snippet that performs any HTTP request. Anything that falls under the NETWORK task category. Note we’re using natural language in the request.
Spoiler: both do a great job.
query = "send a http request"
Model A
Rank | Language | Task | Subtask |
---|---|---|---|
1 | RUBY | NETWORK | GET |
2 | RUBY | NETWORK | PUT |
3 | RUBY | NETWORK | POST |
4 | PY | NETWORK | GET |
5 | PY | NETWORK | PUT |
6 | PY | NETWORK | POST |
7 | PHP | NETWORK | GET |
8 | PHP | NETWORK | POST |
Top 3 snippets
-
Ruby code for GET request
require 'net/http'
uri = URI('http://get.prod9.api.ourservice.com/data')
response = Net::HTTP.get(uri)
puts response
-
Ruby code for PUT request
require 'net/http'
uri = URI('http://put.prod9.api.ourservice.com/data')
http = Net::HTTP.new(uri.host, uri.port)
request = Net::HTTP::Put.new(uri)
request.set_form_data('name' => 'example', 'value' => '123')
response = http.request(request)
puts response.body
-
Ruby code for POST request
require 'net/http'
uri = URI('http://post.prod9.api.ourservice.com/data')
response = Net::HTTP.post_form(uri, 'name' => 'example', 'value' => '123')
puts response.body
Model B
Rank | Language | Task | Subtask |
---|---|---|---|
1 | GO | NETWORK | PUT |
2 | RUBY | NETWORK | PUT |
3 | GO | NETWORK | POST |
4 | RUBY | NETWORK | GET |
5 | PHP | NETWORK | PUT |
6 | RUBY | NETWORK | POST |
7 | JS | NETWORK | PUT |
8 | PHP | NETWORK | POST |
Top 3 snippets
-
Go code for PUT request
package main
import ("bytes" "net/http")
func main() {
client := &http.Client{}
data := []byte("name=example&value=123")
req, _ := http.NewRequest("PUT", "http://put.prod4.api.ourservice.com/data", bytes.NewBuffer(data))
req.Header.Set("Content-Type", "application/x-www-form-urlencoded")
_, _ = client.Do(req)
}
-
Ruby code for PUT request
require 'net/http'
uri = URI('http://put.prod9.api.ourservice.com/data')
http = Net::HTTP.new(uri.host, uri.port)
request = Net::HTTP::Put.new(uri)
request.set_form_data('name' => 'example', 'value' => '123')
response = http.request(request)
puts response.body
-
Go code for POST request
package main
import ("bytes" "net/http")
func main() {
data := []byte("name=example&value=123")
_, _ = http.Post("http://post.prod4.api.ourservice.com/data", "application/x-www-form-urlencoded", bytes.NewBuffer(data))
}
Model A can outperform Model B with natural language requests
Let’s get a bit more specific. Instead of any network request, let’s limit it to Python. So, any Python language NETWORK request.
query = "Invoke our API with Python"
Model A
Rank | Language | Task | Subtask |
---|---|---|---|
1 | PY | NETWORK | POST |
2 | PY | NETWORK | GET |
3 | PY | NETWORK | PUT |
4 | RUBY | NETWORK | GET |
5 | GO | NETWORK | POST |
6 | GO | NETWORK | GET |
7 | JS | NETWORK | GET |
8 | GO | NETWORK | PUT |
Top 3 snippets
-
Python code for POST request
import requests
payload = {'name': 'example', 'value': '123'}
response = requests.post('http://post.prod8.api.ourservice.com/data', data=payload)
print(response.text)
-
Python code for GET request
import requests
response = requests.get('http://get.prod8.api.ourservice.com/data')
print(response.text)
-
Python code for PUT request
import requests
payload = {'name': 'example', 'value': '123'}
response = requests.put('http://put.prod8.api.ourservice.com/data', data=payload)
print(response.text)
Model B
Rank | Language | Task | Subtask |
---|---|---|---|
1 | RUBY | NETWORK | GET |
2 | PY | NETWORK | GET |
3 | RUBY | NETWORK | PUT |
4 | PY | NETWORK | PUT |
5 | CSHARP | NETWORK | GET |
6 | JS | NETWORK | GET |
7 | PY | NETWORK | POST |
8 | PHP | NETWORK | GET |
Top 3 snippets
-
Ruby code for GET request
require 'net/http'
uri = URI('http://get.prod9.api.ourservice.com/data')
response = Net::HTTP.get(uri)
puts response
-
Python code for GET request
import requests
response = requests.get('http://get.prod8.api.ourservice.com/data')
print(response.text)
-
Ruby code for PUT request
require 'net/http'
uri = URI('http://put.prod9.api.ourservice.com/data')
http = Net::HTTP.new(uri.host, uri.port)
request = Net::HTTP::Put.new(uri)
request.set_form_data('name' => 'example', 'value' => '123')
response = http.request(request)
puts response.body
As you can see, things change. Now, model B struggles with understanding what Python actually means. Model A is 3/3 with all top 3 results being Python code snippets.
Why? Model B, being more general, doesn’t fully understand what Python is in the context of its data. It gets close - note that it favors Ruby, which is syntactically like Python. And it does return NETWORK requests, which is what we asked for. But Model A does much better.
...and Model A’s code snippet understanding can help avoid distractions
Let’s move beyond natural language. This time we’ll query with just code, and we’ll try to trick the models.
We’ll do this with a Rust snippet that makes a NETWORK request with a payload of File: UNSPECIFIED. Model A wasn’t trained with Rust, so this should be interesting.
query = """async fn a() -> Result<(), reqwest::Error> {
let c = reqwest::Client::new();
let u = BACKEND_API_URL;
let b = "File: UNSPECIFIED";
let r = c.put(u).body(b).send().await?;
println!("Status: {}", r.status());
Ok(())
}"""
...but Model B gives more relevant results in key situations
Let's run a query to get snippets that open a file.
query = "open a file"
Model A
Rank | Language | Task | Subtask |
---|---|---|---|
1 | RUBY | FILE | APPEND |
2 | PHP | GUI | FILE |
3 | PHP | GUI | URL |
4 | C | FILE | APPEND |
5 | PY | GUI | FILE |
6 | CPP | FILE | APPEND |
7 | RUBY | FILE | READ |
8 | PY | FILE | READ |
Top 3 snippets
-
Ruby code for appending to a file
File.open('data.txt', 'a') { |file| file.puts('More data') }
-
PHP code for opening a file
$fileButton = new GtkButton('Select File');
$fileButton->connect_simple('clicked', function() { echo "File button clicked\n"; });
-
PHP code for opening a URL
$urlButton = new GtkButton('Open URL');
$urlButton->connect_simple('clicked', function() { echo "URL button clicked\n"; });
Model B
Rank | Language | Task | Subtask |
---|---|---|---|
1 | RUBY | FILE | APPEND |
2 | PY | FILE | READ |
3 | RUBY | FILE | READ |
4 | PY | FILE | WRITE |
5 | PY | FILE | APPEND |
6 | RUBY | FILE | WRITE |
7 | C | FILE | READ |
8 | C | FILE | WRITE |
Top 3 snippets
-
Ruby code for appending to a file
File.open('data.txt', 'a') { |file| file.puts('More data') }
-
Python code for reading a file
with open('data.txt', 'r') as file:
data = file.read()
print(data)
-
Ruby code for reading a file
File.open('data.txt', 'r') do |file|
while line = file.gets
puts line
end
end
Model B keeps results to the FILE task, which is what we were looking for.
For some reason, Model A is more scattered. A closer look at its second result shows us why:
$fileButton = new GtkButton('Select File');
$fileButton->connect_simple('clicked', function() { echo "File button clicked\n"; });
This snippet is a GUI button, but the code is for selecting a file to upload.
So Model A’s highly contextual understanding of code led it down the wrong path. It knew that code was for opening a file, but that gave us the wrong result.
Conclusion - which embedding model is right for us?
By now it should be clear that each model had its own advantages, depending on the task we gave it.
The domain-specific model isn’t always the best performing. Sometimes, the general model outperformed the code-specific model because it wasn’t distracted by context.
The best way to find out which model works for you is to experiment!
And... the easiest way for you to generate those embeddings is with Lantern Cloud. Just select a column, pick your model, and Lantern Cloud handles the rest. If you’d like help getting started, or advice on working with embeddings, we’re also here to help. Please get in touch at support@lantern.dev.