Embeddings and choosing the right model

An example to start

Identifying clothing from recent photos is a common task for celebrity fashion Instagram accounts. This is usually done manually by someone with a good memory for fashion. For items they don't recognize, they will search through a list of brands.

You could automate the search process by storing clothing items in a database with various features, such as the color, brand, and type of item. This can become complicated as well. You may have to do data cleaning, feature extraction, and more.

With embedding models, the process can be simplified.

What is an embedding?

An embedding model transforms complex, high-dimensional data (such as text, images, or sounds) into a simpler, lower-dimensional numeric space. This process involves encoding the data into embeddings, where each embedding is an array of numbers that collectively capture the key attributes of the original data. The distance between embeddings represents how similar two items are to each other.

Embeddings enable search, classification, visualization, anomaly detection, data reconstruction, and a variety of other machine learning applications. In this article, we will focus on embeddings for search.

How embeddings help with search

In the earlier example, let's say we have a photo of Taylor Swift in a flowery blue dress.

We want to search over a dataset of clothes to find a dress that looks the most similar, and is hopefully the same one. Without embeddings, we might have to normalize our dataset, to account for brands using synonyms like "blue" and "aqua" for colors, or "mid-length" and "knee-length" for length. We might have to extract features from pictures or descriptions, such as "white flowers". This can become a complex feature engineering task. Then, once we have the set of features, we might try to find some scoring function based on number of matching features.

With embeddings, you could simply use an image embedding model over all of the product images. Then, whenever you want to search for an article of clothing like the flowery blue dress, you can just generate an embedding for that image, and find the most similar product embedding in your dataset.

To find the most similar product embedding in a dataset, people typically use approximate nearest neighbor algorithms, also known as ANN algorithms. The alternative, exact search, would search through the entire dataset, calculating the distance between every embedding to find the nearest one. This can be very computationally intensive for large datasets. For example, Youtube would need to search through 1 billion videos just to find the most similar one. With an ANN algorithm, it could reduce this number to several thousand videos, or even less.

You can directly use a library like FAISS to do ANN search in-memory. However, vector databases like Lantern, Pinecone, or Chroma can be helpful in providing scalability, ease of integration with existing systems, and advanced. These databases are designed to efficiently search through large volumes of embeddings, making them a more practical choice for startups and enterprises. One advantage of Lantern over other alternatives is that it is built on top of Postgres. This means that if you are using Postgres for your database, you can manage and query embeddings without the need for additional libraries, making it easy to integrate embedding search into your application.

What to consider when choosing an embedding model?

To make embeddings as helpful as possible with a search application, it's important to choose a good embedding model. Here are some factors to consider.

Performance

The most obvious aspect to consider is the performance of an embedding model. Some aspects to this include:

Robustness Against Noise: The model should effectively filter out irrelevant information. For example, an image embedding model may focus on representing foreground objects over background clutter.
Generalization Across Datasets: The model's ability to apply its learning to new, unseen data demonstrates its versatility beyond the training set.
Precision in Data Representation: The model accurately captures the fundamental features and relationships of the original data. Embeddings for similar items should have a small distance between them.
Task-Specific Effectiveness: The true measure of an embedding model's success is how well it performs in its intended application. An embedding model intended for images is unlikely to perform well for text.

Specialized Models

The choice between general and specialized models depends on the specific requirements of the application.

General Models like Ada2: Ada2 is a highly effective, widely-used model for various types of text data, including documents, emails, and instant messages. Its versatility and strong performance make it a popular default choice for a broad range of applications.
Specialized Models: In certain cases, a model tailored for a specific type of data or task can offer enhanced accuracy. For instance, CodeBERT is optimized for programming languages and is particularly effective in tasks such as code search, documentation generation, and understanding code. When working with specialized data or for specific applications, these models can provide an edge in performance.

Dimensionality

Higher-dimensional embeddings can capture more information but may also lead to increased costs.

Increased costs can come in several ways with higher-dimensional embeddings. First, the complexity of models generating these embeddings increases, which requires more computational power. Second, the operations involving these embeddings can be resource-intensive. For example, large embeddings may cause the index to exceed memory capacity, necessitating frequent data transfers between memory and disk. Additionally, the storage costs increase; storing 1 million float array embeddings occupies approximately 3.73GB for 1000 dimensions, doubling to 7.45GB for 2000 dimensions.

How Lantern can help

If you want to represent a variety of data, or you want to experiment with different embedding models on the same type of data, setting up the necessary infrastructure for each model can be challenging.

Lantern simplifies this process. With Lantern, can effortlessly choose the desired embedding model, specify the input data, and define the output column. Examples of embedding models that Lantern supports include clip and bge .Lantern then seamlessly manages the generation of embeddings.

Conclusion

In summary, embeddings are a great tool for working with unstructured data. By distilling complex, high-dimensional data into lower-dimensional, numerically interpretable formats, embeddings combined with ANN algorithms enable efficient search.

Check out Lantern Cloud for an easy platform to work with embeddings. If you’d like help with getting started, or advice on working with embeddings, we are happy to offer 1:1 technical support. Please get in touch at support@lantern.dev.

Share this post