Cosine Similarity and Improved Semantic Search Algorithms

Опубликовано: 16 Май 2024
на канале: Stephen Blum
83
0

Cosine similarity is simple and useful. It all comes down to how different two vectors are. For instance, you could have two arrays of three elements each.

How different are they? That's the distance. Cosine can then transfer this into a range from negative one to positive one.

If you're using Postgres and PG vector, you can easily feed in this function, carry out this math, and use your data in the process. You simply pass in the embedding and the column of data you want to analyze. So, say you've catalogued all your data and vectored it into categories in Postgres using PG vector.

You also have a customer, and let's say they're looking for something specific. Here's where cosine similarity comes in. You compare the database vector with the customer's search and voila, you've got a list of similar items.

Let's talk about a fancier approach. As a customer, when you click on what you're looking for from the search results, that can be used as a vote. The more an item is clicked, the higher it's ranked.

Embedding does a good job in the first place, but by adding in user choices, the search becomes even more tailored to what people really want. As for the split between cosine similarity and user votes, there's no hard and fast rule. It could be anything; 70-30, 50-50, or 100-0.

This mix may also change over time; mixing up the algorithm and user choices might result in a more effective search. For example, initially, you might rely only on the algorithm. As you collect user clicks, you can slowly blend in user data to the equation.

In this example, the embedding model was TextEmbedding3 small from OpenAI. But guess what, there are plenty of other text embeddings you can use. It might even be better to use one you can run locally because you'll save on vendor fees. To wrap up, don't forget that a better user experience directly links to how fast the search and the embedding data capturing are.

So it's crucial to choose data centers close to your end users. If you have access to a vendor that provides global distribution, make use of that.