The Hierarchial Navigable Small Worlds (HNSW) Algorithm

The Hierarchial Navigable Small Worlds (HNSW) Algorithm is used to perform approximate nearest neighbor search.

Overview of the HNSW Algorithm

The HNSW algorithm is used for efficiently finding similar vectors in large datasets. It constructs a multi-layered graph, where each layer represents a subset of the dataset. The top layer contains the fewest data points, while the bottom layer contains all the datapoints.

The search process begins at the highest layer, where the algorithm selects the closest vector to the query. This vector is connected to a set of similar vectors in the next layer. The algorithm then moves down layer by layer, each time choosing the nearest vector in the connected subset of that layer. This process continues until it reaches the bottom layer, where it returns the set of similar vectors.

The HNSW algorithm is known for its high performance, particularly during the querying phase. Once the index is constructed, the process of finding approximate nearest neighbors for query points is extremely efficient.

Key Parameters: `EF Construction`, `EF Search`, and `M`

The Hierarchical Navigable Small World (HNSW) algorithm uses three parameters: EF Construction, EF Search, and M.

EF Construction: This parameter is used when the graph is being built. Think of it as the algorithm's thoroughness when it's adding a new point to the graph. A higher EF Construction means the algorithm will search more extensively for neighbors, which can make the graph more accurate. However, this also means it will take more time and resources to build the graph.
EF Search: This parameter comes into play when you're searching for the nearest neighbors of a specific point in the graph. A higher EF Search means the algorithm will look more extensively for nearest neighbors, which can improve the accuracy of the search. However, this might slow down the search process.
M: This parameter determines the maximum number of connections (or edges) a node (or data point) can have in the graph. It influences the density and navigability of the graph, affecting both the build and search phases.

In summary, EF Construction and M affect the building of the graph, with trade-offs between accuracy and resource usage. EF Search influences the search process, with a trade-off between accuracy and speed. Adjusting these parameters can help balance accuracy, speed, and resource usage according to your specific needs.

Comparison to IVF Algorithm and Re-Indexing

The Inverted File (IVF) algorithm is another method used for efficient search in high-dimensional spaces. The IVF algorithm works by partitioning the dataset into distinct clusters, and searching through a subset of the clusters to find the most similar vectors.

However, a significant limitation of the IVF algorithm is the need for re-indexing. As new data is added, the IVF algorithm often requires the index to be recreated for the algorithm to be effective. This can be a time-consuming and computationally expensive process.

In contrast, the HNSW algorihtm is considered more adaptable, and does not need to be re-indexed as often. There are some situations where it might be beneficial:

Major Changes in Data Distribution: If the data changes significantly, re-indexing can help HNSW adjust better to these changes, improving search results.
Large Amount of New Data: Adding a lot of new data at once might disturb HNSW's structure. Re-indexing can help integrate this new data smoothly.
Performance Optimization: Occasionally re-indexing as part of maintenance can keep HNSW efficient, especially in systems that are constantly updating

It's important to note that these scenarios are less common, and the need for re-indexing in HNSW arises much less frequently than in algorithms like IVF.

Implementations

The HNSW algorithm was described in a 2016 paper by Malkov and Yashunin. There are many publicly available implementations of the algorithm, including hnswlib, FAISS by Facebook Research, and USearch.

The USearch library is notable for its compactness and efficiency. USearch is written in C++ and incorporates quantization and hardware acceleration, making it significantly faster (up to 10x faster) than alternatives like FAISS. Its concise codebase, about 1000 lines, is designed for both effectiveness and performance.

Conclusion

In summary, the HNSW algorithm's ability to balance accuracy, speed, and resource usage by adjusting its key parameters makes it a versatile and robust solution for similarity search in high-dimensional spaces. Lantern is a Postgres extension that builds and uses usearch to enable powerful vector search inside a database developers already love.

Share this post