Introduction

Lately I’ve been seeing these new vector databases all over the place, it triggered my curiosity and wanted to see what’s behind all that hype. So, naturally, I turned to Google and did some reading to get a grasp on the topic.

In this article, I'll walk you through some of the basic concepts of vector databases, and we will build an end to end artwork similarity search project using Qdrant Vector database and Streamlit.

Without further ado, let’s dive right in.

Vectors, what are they?

Before delving into vector databases, it's important to grasp the concept of a vector.

A vector is a data structure typically comprised of at least two components: magnitude and direction. However, for simplicity, let's consider it as a list of scalar values, such as [2.7, -1.13, 0.45, 4.87].

But why are they referred to as vectors when they appear to be nothing more than lists of numbers?

Well, the term "vector" is used to emphasize the mathematical and computational aspects of these organized lists of numbers. As we will see, we can perform various calculations with vectors, such as computing dot products, determining distance between vectors to measure similarity, and more.

What are they useful for you might ask?

In practice, data comes in different forms, ranging from structured formats like tabular data to unstructured formats like images, text and sound, and in order to make this data usable by machine learning models, we usually need to extract features, using what’s called feature engineering. However, there are instances when the data exists in high-dimensional spaces, making it incredibly challenging to manually extract meaningful features. This is why we need a way to automatically extract important features.

To solve this challenge, we can use pretrained models. These models can easily extract and transform our data into vectors or vector embeddings while preserving the valuable information, because we don’t just want to convert our data into a list of random numbers, but also conserve key features and characteristics of the data as well.

For example, Word2Vec, BERT ****and GPT are some famous embedding models trained on large labeled text data. They are often used to embed text while preserving the meaning of sentences and capturing whether they convey the same meaning or not.

And for images, models like VGG, ResNet and Inception are often used, they can easily extract key features from images since they are trained on very large datasets.

Untitled