Deep Dive into Vector Similarity Search within Postgres and pgvector

Deep Dive into Vector Similarity Search within Postgres and pgvector

When building software products, we often deal with data in formats like text, images, and audio. Usually we store these in data warehouses or storage system like AWS S3, Cloudflare R2, etc. However, the raw data is usually not enough for us to do anything other than storage and retrieval.

When we want to go further than just CRUD (Create, Read, Update, Delete), we need to find a different way to represent the data and one of the most popular way is to represent the data as vector embeddings, which are numerical representations that capture the essence of the data. These embeddings are very useful for AI-powered applications, helping with tasks like finding similar content for RAG (Retrieval-Augmented Generation), recommending items, spotting anomalies, and recognizing images. The key to making these applications work well is to store, index, and search these high-dimensional vectors efficiently. That’s where pgvector comes in.

pgvector is an open-source extension for PostgreSQL that adds the ability to manage vector embeddings directly within the database. It’s not just a storage solution; it’s a full vector search engine that’s designed for performance and ease of use. By integrating vector search into PostgreSQL, pgvector makes it easier for everyone to use this powerful technology.

So, why choose pgvector? For starters, it keeps your vector data close to your other data, which cuts down on the time it takes to move data around. This is especially important for applications where speed matters. It also uses PostgreSQL’s strong transactional features, ensuring your data stays consistent and reliable. You can use all of PostgreSQL’s powerful tools, like joining tables, recovering data to any point in time, and setting up secure and scalable systems. Plus, it simplifies your setup by not needing extra databases or complex systems, which saves time and money.

This blog post is for developers building applications that need to search for similar vectors, data scientists working with embeddings, and anyone managing data systems who wants a straightforward way to handle vector data. We’ll cover everything from the basics of vector embeddings and distance metrics, to installing pgvector, working with vector data, and optimizing performance for your applications.

Core Concepts

Before we dive into pgvector, let’s first understand the core concepts of vector embeddings, distance metrics, and a quick look into different data types that pgvector supports.

What are Vector Embeddings?

Vector embeddings are numerical representations of data. Each number in the vector stands for a feature or characteristic of the data. These vectors are created by machine learning models that transform complex data into a simpler form that captures the data’s meaning. The idea is that things that are similar in meaning will have vectors that are close together, while things that are different will have vectors that are far apart. Here’s a simple example vector that represents a piece of text:

[0.1, 0.3, 0.5, 0.7, 0.2]

There are a variety of ways to create these embeddings, including:

  • Word embeddings like Word2Vec, GloVe, and fastText turn words into vectors based on how they appear in text.
  • Sentence embeddings, like those from Sentence-BERT, do the same for whole sentences.
  • Large Language Models (LLMs) like GPT and BERT can generate rich embeddings that capture complex semantic relationships in text.
  • Image embeddings use convolutional neural networks to turn images into vectors, capturing their visual features.
  • Audio embeddings can be created from spectrograms, which show the frequency components of sound over time.

These embeddings are used for tasks like semantic search, where you find content that’s related to a query, not just matching keywords. They’re also used in recommendation systems, content filtering, anomaly detection, finding similar images or videos, and powering LLM-based applications that need to efficiently search through large knowledge bases.

Vector Distance Metrics Guides

Once we have decided on a way to represent our data as vectors, the next step is to measure how similar two vectors are. This is done using distance metrics. Here are some common ones:

  • L2 Distance (Euclidean): This measures the straight-line distance between two points. It’s sensitive to how big the vectors are and is good for general similarity checks.

    sqrt(sum((xi - yi)^2))
    
  • Inner Product: This measures how much one vector projects onto another. It’s faster to compute than L2 distance but also sensitive to vector size.

    sum(xi * yi)
    
  • Cosine Distance: This measures the angle between two vectors, ignoring their size. It’s good for comparing the direction of vectors, often used with text data.

    1 - (sum(xi * yi) / (sqrt(sum(xi^2)) * sqrt(sum(yi^2))))
    
  • L1 Distance (Manhattan/Taxicab): This is the sum of the absolute differences between vector components. It’s less sensitive to outliers than L2 distance.

    sum(|xi - yi|)
    
  • Hamming Distance: This counts the number of different bits between two binary vectors. It’s simple and fast to compute.

    sum(xi != yi)
    
  • Jaccard Distance: This measures how different two sets are, represented as binary vectors. It’s useful for comparing sets of features.

    1 - (sum(xi AND yi) / sum(xi OR yi))
    

Data Types

pgvector supports different data types for storing vectors:

  • vector: This stores single-precision floating-point numbers (4 bytes per element). It’s good for most embeddings where high precision isn’t critical. It supports up to 16,000 dimensions.

  • halfvec: This stores half-precision floating-point numbers (2 bytes per element). It uses less memory but at the cost of some precision. It’s good for large datasets where memory is a concern.

  • bit: This stores binary vectors (bits). It’s suitable for binary data like image hashes or boolean similarity searches.

  • sparsevec: This stores sparse vectors, where most elements are zero. It’s efficient for high-dimensional vectors with many zero values, like text embeddings with large vocabularies.

Installation and Setup

Now that we have a basic understanding of vector embeddings, distance metrics, and data types, let’s dive into how to install and setup pgvector.

Installation Methods

There are several ways to install pgvector, depending on your operating system.

Linux and Mac

  • Compiling from Source: To compile pgvector from source, follow these steps:

    1. Clone the repository with git clone --branch v<version> https://github.com/pgvector/pgvector.git, replacing <version> with the desired version tag.
    2. Navigate to the directory with cd pgvector.
    3. Compile the extension with make.
    4. Install it with make install. You might need to use sudo for this step.
    5. If you have multiple PostgreSQL installations, specify the path to pg_config with export PG_CONFIG=/path/to/your/pg_config.
    6. If you run into compilation issues, clean the build directory with make clean before trying again.
    7. When using sudo for make install, use sudo --preserve-env=PG_CONFIG make install to pass the PG_CONFIG variable correctly.

    Common pg_config paths on Mac include /Library/PostgreSQL/<version>/bin/pg_config for EDB installer, /opt/homebrew/opt/postgresql@<version>/bin/pg_config for Homebrew (arm64), and /usr/local/opt/postgresql@<version>/bin/pg_config for Homebrew (x86-64).

    If you encounter issues like missing headers, install the Postgres development files with sudo apt install postgresql-server-dev-<version> on Debian/Ubuntu, or the equivalent for other distributions. If you get a warning about a missing sysroot directory on Mac, reinstall Xcode Command Line Tools. For portability, compile with make OPTFLAGS="" to avoid Illegal instruction errors on different machines.

  • Docker: To use pgvector with Docker, pull the image with docker pull pgvector/pgvector:pg<version>, replacing <version> with your PostgreSQL version (e.g., pg16). Run the Docker image as you would a regular PostgreSQL container. You can also build the Docker image manually from source by cloning the pgvector repository, navigating to the directory, and running docker build --pull --build-arg PG_MAJOR=<version> -t <your_username>/pgvector ., replacing <version> with the PostgreSQL major version and <your_username> with your Docker Hub username.

  • Homebrew: Install pgvector with brew install pgvector. Note that this is only available for specific PostgreSQL formulas like postgresql@17 and postgresql@14.

  • PGXN: Install pgvector with pgxn install vector. This requires the pgxn command-line tool to be installed.

  • APT (Debian/Ubuntu): Set up the PostgreSQL APT repository and install pgvector with sudo apt install postgresql-<version>-pgvector.

  • Yum (RedHat/CentOS): Set up the PostgreSQL Yum repository and install pgvector with sudo yum install pgvector_<version> or sudo dnf install pgvector_<version>.

  • pkg (FreeBSD): Install pgvector with pkg install postgresql<version>-pgvector or cd /usr/ports/databases/pgvector && make install.

  • conda-forge: Install pgvector with conda install -c conda-forge pgvector. This is community-maintained by @mmcauliffe.

  • Postgres.app: pgvector comes preinstalled with Postgres.app versions 15 and above.

  • Hosted Providers: pgvector is available on many hosted PostgreSQL providers. Check with your provider for specific instructions.

Windows

  • Using nmake: To install pgvector on Windows using nmake, follow these steps:

    1. Ensure that C++ support is installed in Visual Studio.
    2. Open the Visual Studio command prompt (e.g., “Developer Command Prompt for VS 2022”).
    3. Set the PGROOT environment variable to the PostgreSQL installation directory with set "PGROOT=C:\Program Files\PostgreSQL\<version>", adjusting the path accordingly.
    4. Clone the pgvector repository with git clone --branch v<version> https://github.com/pgvector/pgvector.git.
    5. Navigate to the pgvector directory with cd pgvector.
    6. Compile the extension with nmake /F Makefile.win.
    7. Install it with nmake /F Makefile.win install.

    Note that PostgreSQL 17 is not supported with MSVC yet due to an upstream issue.

  • Docker: Use the official pgvector Docker image for Windows.

  • conda-forge: Install pgvector with conda install -c conda-forge pgvector.

Enabling the Extension

To enable pgvector in your PostgreSQL database, connect to the database using psql or your preferred client and run the command CREATE EXTENSION vector;. This only needs to be done once per database.

Creating Vector Columns

To create columns for storing vectors, you can use fixed or variable dimensions:

  • Fixed Dimensions: Use the vector(n) data type, where n is the number of dimensions. For example, CREATE TABLE items (id bigserial PRIMARY KEY, embedding vector(128));. This is used when all vectors in the column have the same number of dimensions.

  • Variable Dimensions: Use the vector data type without specifying the number of dimensions. For example, CREATE TABLE embeddings (model_id bigint, item_id bigint, embedding vector, PRIMARY KEY (model_id, item_id));. This is used when vectors in the column may have different numbers of dimensions. Note that you can only create indexes on rows with the same number of dimensions using expression and partial indexing.

For example, say you want to use OpenAI’s text-embedding-3-small model to embed text data. The model returns 1536-dimensional vectors, so you would create a table with a vector(1536) column:

CREATE TABLE items (id bigserial PRIMARY KEY, embedding vector(1536));

Or if you were to use their text-embedding-3-large model, which returns 3072-dimensional vectors, you would create a table with a vector(3072) column:

CREATE TABLE items (id bigserial PRIMARY KEY, embedding vector(3072));

Working with Vectors

Inserting Vectors

To insert vectors into your table, you can use the INSERT command or bulk load with COPY.

  • Using INSERT: Insert vector values as string representations of arrays, enclosed in square brackets []. For example, INSERT INTO items (embedding) VALUES ('[1, 2, 3]'), ('[4, 5, 6]');. The string representation is automatically converted to the vector type. This method may be less efficient for large numbers of inserts.

  • Bulk Loading with COPY: Use the COPY command for efficient bulk loading of vector data from a file or standard input. Use the FORMAT BINARY option for optimal performance. For example, COPY items (embedding) FROM STDIN WITH (FORMAT BINARY);.

Here’s a Python code example using the psycopg2 library to load vector data using COPY with binary format:

import psycopg2
from psycopg2 import sql
import numpy as np

# Connect to the database
conn = psycopg2.connect(
    dbname="your_database",
    user="your_username",
    password="your_password",
    host="your_host",
    port="your_port"
)

# Prepare the data
vectors = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# Create a cursor
cur = conn.cursor()

# Create a temporary file-like object
from io import BytesIO
f = BytesIO()

# Write the data to the file-like object
for vector in vectors:
    f.write(vector.tobytes())

# Seek to the beginning of the file-like object
f.seek(0)

# Execute the COPY command
cur.copy_from(f, 'items', columns=('embedding',), format='binary')

# Commit the transaction
conn.commit()

# Close the cursor and connection
cur.close()
conn.close()

Updating and Deleting Vectors

To modify existing vectors, use the UPDATE command. For example, UPDATE items SET embedding = '[1, 2, 3]' WHERE id = 1;. To remove vectors, use the DELETE command. For example, DELETE FROM items WHERE id = 1;. Note that updating large numbers of vectors may impact performance.

Querying Vectors

To find the nearest neighbors to a given vector, use the ORDER BY clause with a distance operator. For example, SELECT * FROM items ORDER BY embedding <-> '[3, 1, 2]' LIMIT 5;.

Here are the available distance operators:

OperatorDistance MetricNotes
<->L2 (Euclidean)Most common for general similarity search
<#>Inner ProductReturns negative inner product
<=>CosineGood for comparing vector directions
<+>L1 (Manhattan)Added in pgvector 0.7.0
<~>HammingFor binary vectors, added in 0.7.0
<%>JaccardFor binary vectors, added in 0.7.0

Note: For more details on distance operators and their mathematical foundations, refer to the Vector Distance Metrics Guides section.

Use the LIMIT clause to specify the number of nearest neighbors to return. Ensure that you have created appropriate indexes for efficient nearest neighbor search.

To find neighbors based on another row’s embedding, use a subquery to select the reference vector. For example, SELECT * FROM items WHERE id != 1 ORDER BY embedding <-> (SELECT embedding FROM items WHERE id = 1) LIMIT 5;. This is useful for finding similar items based on the embedding of a specific item.

To filter results based on their distance to a reference vector, use a WHERE clause. For example, SELECT * FROM items WHERE embedding <-> '[3, 1, 2]' < 5;. Note that filtering by distance can significantly impact performance if not used with appropriate indexes.

Calculating Distances

To calculate the distance between each vector in the table and a reference vector, use distance operators in the SELECT clause. For example, SELECT embedding <-> '[3, 1, 2]' AS distance FROM items;. Use the AS keyword to give an alias to the calculated distance.

To calculate cosine similarity, use the formula 1 - cosine_distance. For example, SELECT 1 - (embedding <=> '[3, 1, 2]') AS cosine_similarity FROM items;. Cosine similarity ranges from -1 to 1, where 1 indicates perfect similarity and -1 indicates perfect dissimilarity.

To calculate the inner product, note that the <#> operator returns the negative inner product, so multiply by -1 to get the actual inner product. For example, SELECT (embedding <#> '[3, 1, 2]') * -1 AS inner_product FROM items;.

Aggregating Vectors

To calculate the average of vectors in a column, use the AVG() aggregate function. For example, SELECT AVG(embedding) FROM items;. This is useful for computing the centroid of a cluster of vectors.

To calculate the average of vectors within each group, use the GROUP BY clause. For example, SELECT category_id, AVG(embedding) FROM items GROUP BY category_id;. This is useful for calculating the average embedding for each category or group.

Indexing for Performance

There are two types of search: exact and approximate.

  • Exact Search: This returns the true nearest neighbors for a given query. It typically involves scanning the entire dataset or using a non-approximate index like a B-tree on a distance calculation, which can be very slow for large datasets. It provides 100% recall (all relevant results are returned) but performance degrades significantly as the dataset size increases. It’s suitable for small datasets or when perfect recall is critical.

  • Approximate Search: This returns a set of neighbors that are likely to be close to the query vector, but not necessarily the absolute nearest neighbors. It uses specialized data structures like HNSW or IVFFlat indexes that allow for fast but approximate similarity search. It trades some recall for speed (some relevant results may be missed) but provides much better performance for large datasets. It’s ideal for large datasets where high performance is more important than perfect recall.

HNSW Index

The HNSW (Hierarchical Navigable Small World) index is a type of approximate index that creates a multi-layer graph structure. Each layer represents a progressively coarser approximation of the vector space. Nodes in the graph are connected based on their proximity to each other, and search starts at the top layer, navigating to the nearest neighbors in lower layers to quickly narrow down the search space. The graph is designed to be a “small world,” meaning that any two nodes can be connected through a relatively small number of hops, and it’s constructed dynamically as new vectors are added to the index.

To create an HNSW index, use the syntax CREATE INDEX ON items USING hnsw (embedding <operator_class>);, where <operator_class> is one of vector_l2_ops for L2 distance, vector_ip_ops for inner product, vector_cosine_ops for cosine distance, vector_l1_ops for L1 distance, bit_hamming_ops for Hamming distance, bit_jaccard_ops for Jaccard distance, halfvec_l2_ops for halfvec and corresponding operators for other distance metrics, or sparsevec_l2_ops for sparsevec and corresponding operators for other distance metrics. For example, CREATE INDEX ON items USING hnsw (embedding vector_l2_ops);. Note that you need to create a separate HNSW index for each distance function you want to use.

HNSW index options include m (max connections), which specifies the maximum number of connections per layer in the graph (default is 16), and ef_construction (construction list size), which specifies the size of the dynamic candidate list used during graph construction (default is 64). Higher values of m can lead to better recall but also increase the index size and build time, while higher values of ef_construction can lead to better recall but also increase the index build time and insert speed. The syntax for specifying these options is CREATE INDEX ON items USING hnsw (embedding vector_l2_ops) WITH (m = 16, ef_construction = 64);.

Query options for HNSW indexes include hnsw.ef_search (search list size), which specifies the size of the dynamic candidate list used during search (default is 40). Higher values of ef_search can lead to better recall but also increase query time. The syntax for setting this option is SET hnsw.ef_search = 100;. You can also use SET LOCAL hnsw.ef_search = 100; inside a transaction to set the value for a single query.

To speed up index builds, increase maintenance_work_mem so that the graph fits into memory. However, be careful not to set it so high that it exhausts the memory on the server. Starting with pgvector 0.6.0, you can also speed up index creation by increasing the number of parallel workers using max_parallel_maintenance_workers (default is 2) and max_parallel_workers (default is 8). It’s generally faster to create an index after loading your initial data.

To monitor the progress of index creation, use the pg_stat_progress_create_index view. The phases of index creation include initializing and loading tuples.

IVFFlat Index

The IVFFlat (Inverted File with Flat List) index divides vectors into lists based on clustering. Vectors are clustered into a number of lists using k-means clustering, and an inverted index is created that maps each list to the set of vectors that belong to that list. During search, the query vector is compared to the centroids of the lists, and the lists that are closest to the query vector are searched.

To create an IVFFlat index, use the syntax CREATE INDEX ON items USING ivfflat (embedding <operator_class>) WITH (lists = <number>);, where <operator_class> is one of vector_l2_ops for L2 distance, vector_ip_ops for inner product, vector_cosine_ops for cosine distance, bit_hamming_ops for Hamming distance, or halfvec_l2_ops for halfvec and corresponding operators for other distance metrics. The lists option specifies the number of lists to create (a good starting point is rows / 1000 for up to 1M rows and sqrt(rows) for over 1M rows). For example, CREATE INDEX ON items USING ivfflat (embedding vector_l2_ops) WITH (lists = 100);.

Query options for IVFFlat indexes include ivfflat.probes (probes), which specifies the number of lists to search (default is 1). Higher values of probes can lead to better recall but also increase query time. Setting probes to the number of lists will result in exact nearest neighbor search (the planner will not use the index). The syntax for setting this option is SET ivfflat.probes = 10;. You can also use SET LOCAL ivfflat.probes = 10; inside a transaction to set the value for a single query.

To speed up index creation, use parallel workers (max_parallel_maintenance_workers). To monitor the progress of index creation, use the pg_stat_progress_create_index view. The phases of index creation include initializing, performing k-means, assigning tuples, and loading tuples.

Filtering with Indexes

To filter results efficiently, you can create a standard index on the column you want to filter by. For example, CREATE INDEX ON items (category_id);. You can use B-tree (default), hash, GiST, SP-GiST, GIN, or BRIN indexes. For filtering by multiple columns, use multicolumn indexes. This is suitable for conditions that match a low percentage of rows.

When using approximate indexes with filtering, the filter is applied after the index scan. To get more results before the filter is applied, increase hnsw.ef_search or ivfflat.probes.

For filtering by a few distinct values, use partial indexing. For example, CREATE INDEX ON items USING hnsw (embedding vector_l2_ops) WHERE (category_id = 123);.

For filtering by a specific column, consider partitioning the table based on that column. For example, CREATE TABLE items (embedding vector(3), category_id int) PARTITION BY LIST(category_id);. This is suitable for conditions that match a high percentage of rows.

By understanding and using these concepts and techniques, you can effectively use pgvector to manage and search vector embeddings within PostgreSQL, building powerful and efficient AI-powered applications.

Partitioning

Partitioning is another technique that can improve query performance when filtering by specific columns. By dividing your table into smaller, more manageable pieces based on the values of a column, you can speed up queries that target a subset of the data.

To set up partitioning, you can use the PARTITION BY clause when creating your table. For example, if you want to partition the items table by category_id, you would use:

CREATE TABLE items (
    id bigserial PRIMARY KEY,
    embedding vector(128),
    category_id int
) PARTITION BY LIST(category_id);

After creating the partitioned table, you need to create the actual partitions. For example, to create partitions for category_id values 1, 2, and 3, you would use:

CREATE TABLE items_category_1 PARTITION OF items FOR VALUES IN (1);
CREATE TABLE items_category_2 PARTITION OF items FOR VALUES IN (2);
CREATE TABLE items_category_3 PARTITION OF items FOR VALUES IN (3);

When you insert data into the items table, PostgreSQL will automatically route the data to the appropriate partition based on the category_id value. This means that queries filtering by category_id will only need to scan the relevant partition, which can significantly improve performance.

For example, to find the nearest neighbors to a given vector but only within category_id 1, you would use:

SELECT * FROM items WHERE category_id = 1 ORDER BY embedding <-> '[3, 1, 2]' LIMIT 5;

This query will only scan the items_category_1 partition, making it much faster than scanning the entire table.

Performance Optimization Strategies

To build scalable and efficient vector search applications with pgvector, you need to consider several performance optimization strategies.

Choosing the Right Index Type

The choice between HNSW and IVFFlat indexes depends on your specific use case. HNSW indexes are generally faster for search but slower to build and update, while IVFFlat indexes are faster to build and update but slower for search. If you need to frequently insert or update vectors, IVFFlat might be a better choice. If search performance is more critical, HNSW might be better.

Tuning Index Parameters

Both HNSW and IVFFlat indexes have parameters that can be tuned to balance recall and performance. For HNSW indexes, you can adjust m and ef_construction when creating the index, and ef_search when querying. For IVFFlat indexes, you can adjust lists when creating the index and probes when querying.

To find the optimal settings, you may need to experiment with different values and measure their impact on your specific workload. For example, you might start with the default values and then increase ef_search or probes to see if it improves recall at the cost of query time.

Using Appropriate Data Types

Choosing the right data type for your vectors can also impact performance. If you’re working with high-dimensional vectors and memory usage is a concern, consider using halfvec instead of vector. If you’re working with binary vectors, use the bit data type. For sparse vectors, use sparsevec to save space and improve query performance.

Optimizing Queries

To optimize your queries, consider the following strategies:

  • Use Indexes: Always use indexes for vector similarity search. Without an index, PostgreSQL will have to scan the entire table, which can be very slow for large datasets.

  • Limit Results: Use the LIMIT clause to return only the number of results you need. This can significantly reduce the amount of data that needs to be processed and returned.

  • Filter Early: If you’re filtering by a column other than the vector column, create an index on that column and use it to filter early in the query. This can reduce the number of rows that need to be processed for the vector similarity search.

  • Use Partial Indexes: If you’re frequently filtering by a specific value or range of values, consider using partial indexes to index only the relevant rows.

  • Use Partitioning: If you’re frequently filtering by a specific column, consider partitioning your table based on that column to speed up queries that target a subset of the data.

Monitoring and Tuning PostgreSQL

To ensure optimal performance, you should also monitor and tune PostgreSQL itself. Key areas to focus on include:

  • Memory Settings: Adjust shared_buffers, work_mem, and maintenance_work_mem to ensure PostgreSQL has enough memory to operate efficiently.

  • Parallelism: Adjust max_parallel_workers, max_parallel_workers_per_gather, and max_parallel_maintenance_workers to take advantage of multiple CPU cores.

  • Disk I/O: Ensure that your disk I/O is fast enough to handle your workload. Consider using SSDs and optimizing your storage configuration.

  • Connection Pooling: Use connection pooling to reduce the overhead of creating and closing database connections.

  • Query Planning: Use EXPLAIN and EXPLAIN ANALYZE to understand how PostgreSQL is executing your queries and identify potential bottlenecks.

By carefully considering these performance optimization strategies, you can build scalable and efficient vector search applications with pgvector.

Conclusion

As you can see, pgvector is a powerful tool that brings vector similarity search directly into PostgreSQL, making it easier to build applications that can take advantage of vector embeddings, similarity search, and AI for a wide range of use cases such as Retrieval-Augmented Generation (RAG), recommendation systems, anomaly detection, and image recognition.

Of course, there are many dedicated vector storage and search engines out there, but pgvector is a great option if you prefer to keep your embeddings in your primary database and take advantage of colocation with your other data.

I highly recommend giving pgvector a try and see how whether you can use it to improve your applications.