Embedding·Learn

Picking the right embedding model for your vector database

January 15, 2024 · 12 min read

When working with vector databases, it’s tempting to pick the first or most popular model you can find. You don’t want to waste time experimenting when you could be making progress, right?

Today, we’re going to show you

  1. Why this isn’t the best idea, and
  2. Why you should test different models for your use case.

To do that, we’ll walk through a real-world example using two different embedding models. One was trained on code-specific data, and the other on one-size-fits-all generic training data. We’ll use these two training sets to build a code snippet search engine, and see how they perform. If you’re interested in the details, or want to run it yourself, here’s our Colab notebook with the step-by-step. It’s also got more examples.

How to get started

1. Set up our vector database, Lantern

We’ll be using Lantern, a faster, cheaper alternative to Pinecone, as our vector database.

2. Load our dataset

The dataset has 80+ code snippets across 9 programming languages:

  • C
  • C++
  • C#
  • Go
  • Java
  • Javascript
  • PHP
  • Python
  • Ruby

The code snippets fall into three categories:

  • File handling code (tagged FILE)
  • Networking code that makes http requests (tagged NETWORK)
  • Code related to building user interfaces (tagged GUI)

3. Connect to postgres and enable Lantern, then set up our data table

Lantern connects via Postgres, which is helpful if we’re dealing with many types of data.

4. Load both our models

We’ll use two models. One was trained on code snippets, and supports all 9 languages we’re searching. We’ll call this “Model A”.

The second model is generic, trained on hundreds of millions of sentence pairs, with very little exposure to code. We’ll call this “Model B”.

Both of these models have ~110 million parameters.

5. Index and insert our data into the table

6. Write the functions that will search the dataset

The only difference between the two functions is the models they search. One will search Model A, the other Model B. We’ll also add a helper function to print the results in a readable format.

Now, we’re ready to run our queries!

Now that we’re set up, let’s test our embedding models

Let’s look at an example where both models do well

We’ll start simple - ask both models to find a code snippet that performs any HTTP request. Anything that falls under the NETWORK task category. Note we’re using natural language in the request.

Spoiler: both do a great job.

python

query = "send a http request"

Model A

Rank

Language

Task

Subtask

1

RUBY

NETWORK

GET

2

RUBY

NETWORK

PUT

3

RUBY

NETWORK

POST

4

PY

NETWORK

GET

5

PY

NETWORK

PUT

6

PY

NETWORK

POST

7

PHP

NETWORK

GET

8

PHP

NETWORK

POST

Top 3 snippets

  1. Ruby code for GET request

    require 'net/http'

    uri = URI('http://get.prod9.api.ourservice.com/data')

    response = Net::HTTP.get(uri)

    puts response

  2. Ruby code for PUT request

    require 'net/http'

    uri = URI('http://put.prod9.api.ourservice.com/data')

    http = Net::HTTP.new(uri.host, uri.port)

    request = Net::HTTP::Put.new(uri)

    request.set_form_data('name' => 'example', 'value' => '123')

    response = http.request(request)

    puts response.body

  3. Ruby code for POST request

    require 'net/http'

    uri = URI('http://post.prod9.api.ourservice.com/data')

    response = Net::HTTP.post_form(uri, 'name' => 'example', 'value' => '123')

    puts response.body

Model B

Rank

Language

Task

Subtask

1

GO

NETWORK

PUT

2

RUBY

NETWORK

PUT

3

GO

NETWORK

POST

4

RUBY

NETWORK

GET

5

PHP

NETWORK

PUT

6

RUBY

NETWORK

POST

7

JS

NETWORK

PUT

8

PHP

NETWORK

POST

Top 3 snippets

  1. Go code for PUT request

    package main

    import ("bytes" "net/http")

    func main() {

    client := &http.Client{}

    data := []byte("name=example&value=123")

    req, _ := http.NewRequest("PUT", "http://put.prod4.api.ourservice.com/data", bytes.NewBuffer(data))

    req.Header.Set("Content-Type", "application/x-www-form-urlencoded")

    _, _ = client.Do(req)

    }

  2. Ruby code for PUT request

    require 'net/http'

    uri = URI('http://put.prod9.api.ourservice.com/data')

    http = Net::HTTP.new(uri.host, uri.port)

    request = Net::HTTP::Put.new(uri)

    request.set_form_data('name' => 'example', 'value' => '123')

    response = http.request(request)

    puts response.body

  3. Go code for POST request

    package main

    import ("bytes" "net/http")

    func main() {

    data := []byte("name=example&value=123")

    _, _ = http.Post("http://post.prod4.api.ourservice.com/data", "application/x-www-form-urlencoded", bytes.NewBuffer(data))

    }

Model A can outperform Model B with natural language requests

Let’s get a bit more specific. Instead of any network request, let’s limit it to Python. So, any Python language NETWORK request.

python

query = "Invoke our API with Python"

Model A

Rank

Language

Task

Subtask

1

PY

NETWORK

POST

2

PY

NETWORK

GET

3

PY

NETWORK

PUT

4

RUBY

NETWORK

GET

5

GO

NETWORK

POST

6

GO

NETWORK

GET

7

JS

NETWORK

GET

8

GO

NETWORK

PUT

Top 3 snippets

  1. Python code for POST request

    import requests

    payload = {'name': 'example', 'value': '123'}

    response = requests.post('http://post.prod8.api.ourservice.com/data', data=payload)

    print(response.text)

  2. Python code for GET request

    import requests

    response = requests.get('http://get.prod8.api.ourservice.com/data')

    print(response.text)

  3. Python code for PUT request

    import requests

    payload = {'name': 'example', 'value': '123'}

    response = requests.put('http://put.prod8.api.ourservice.com/data', data=payload)

    print(response.text)

Model B

Rank

Language

Task

Subtask

1

RUBY

NETWORK

GET

2

PY

NETWORK

GET

3

RUBY

NETWORK

PUT

4

PY

NETWORK

PUT

5

CSHARP

NETWORK

GET

6

JS

NETWORK

GET

7

PY

NETWORK

POST

8

PHP

NETWORK

GET

Top 3 snippets

  1. Ruby code for GET request

    require 'net/http'

    uri = URI('http://get.prod9.api.ourservice.com/data')

    response = Net::HTTP.get(uri)

    puts response

  2. Python code for GET request

    import requests

    response = requests.get('http://get.prod8.api.ourservice.com/data')

    print(response.text)

  3. Ruby code for PUT request

    require 'net/http'

    uri = URI('http://put.prod9.api.ourservice.com/data')

    http = Net::HTTP.new(uri.host, uri.port)

    request = Net::HTTP::Put.new(uri)

    request.set_form_data('name' => 'example', 'value' => '123')

    response = http.request(request)

    puts response.body

As you can see, things change. Now, model B struggles with understanding what Python actually means. Model A is 3/3 with all top 3 results being Python code snippets.

Why? Model B, being more general, doesn’t fully understand what Python is in the context of its data. It gets close - note that it favors Ruby, which is syntactically like Python. And it does return NETWORK requests, which is what we asked for. But Model A does much better.

...and Model A’s code snippet understanding can help avoid distractions

Let’s move beyond natural language. This time we’ll query with just code, and we’ll try to trick the models.

We’ll do this with a Rust snippet that makes a NETWORK request with a payload of File: UNSPECIFIED. Model A wasn’t trained with Rust, so this should be interesting.

python

query = """async fn a() -> Result<(), reqwest::Error> {
   let c = reqwest::Client::new();
   let u = BACKEND_API_URL;
   let b = "File: UNSPECIFIED";
   let r = c.put(u).body(b).send().await?;
   println!("Status: {}", r.status());
   Ok(())
}"""

...but Model B gives more relevant results in key situations

Let's run a query to get snippets that open a file.

python

query = "open a file"

Model A

Rank

Language

Task

Subtask

1

RUBY

FILE

APPEND

2

PHP

GUI

FILE

3

PHP

GUI

URL

4

C

FILE

APPEND

5

PY

GUI

FILE

6

CPP

FILE

APPEND

7

RUBY

FILE

READ

8

PY

FILE

READ

Top 3 snippets

  1. Ruby code for appending to a file

    File.open('data.txt', 'a') { |file| file.puts('More data') }

  2. PHP code for opening a file

    $fileButton = new GtkButton('Select File');

    $fileButton->connect_simple('clicked', function() { echo "File button clicked\n"; });

  3. PHP code for opening a URL

    $urlButton = new GtkButton('Open URL');

    $urlButton->connect_simple('clicked', function() { echo "URL button clicked\n"; });

Model B

Rank

Language

Task

Subtask

1

RUBY

FILE

APPEND

2

PY

FILE

READ

3

RUBY

FILE

READ

4

PY

FILE

WRITE

5

PY

FILE

APPEND

6

RUBY

FILE

WRITE

7

C

FILE

READ

8

C

FILE

WRITE

Top 3 snippets

  1. Ruby code for appending to a file

    File.open('data.txt', 'a') { |file| file.puts('More data') }

  2. Python code for reading a file

    with open('data.txt', 'r') as file:

    data = file.read()

    print(data)

  3. Ruby code for reading a file

    File.open('data.txt', 'r') do |file|

    while line = file.gets

    puts line

    end

    end

Model B keeps results to the FILE task, which is what we were looking for.

For some reason, Model A is more scattered. A closer look at its second result shows us why:

$fileButton = new GtkButton('Select File');

$fileButton->connect_simple('clicked', function() { echo "File button clicked\n"; });

This snippet is a GUI button, but the code is for selecting a file to upload.

So Model A’s highly contextual understanding of code led it down the wrong path. It knew that code was for opening a file, but that gave us the wrong result.

Conclusion - which embedding model is right for us?

By now it should be clear that each model had its own advantages, depending on the task we gave it.

The domain-specific model isn’t always the best performing. Sometimes, the general model outperformed the code-specific model because it wasn’t distracted by context.

The best way to find out which model works for you is to experiment!

And... the easiest way for you to generate those embeddings is with Lantern Cloud. Just select a column, pick your model, and Lantern Cloud handles the rest. If you’d like help getting started, or advice on working with embeddings, we’re also here to help. Please get in touch at support@lantern.dev.

Authors

Di Qi

Di Qi

Cofounder

Danyil Blyschak

Danyil Blyschak

Software Engineer

Share this post