AI Research: Image to Image Semantic Search

Have you ever wondered how to build a search engine that understands images like humans do? With modern AI tools, you can create powerful image-to-image search tools using LangChain and LLaVA that can find similar images based on natural language descriptions. This innovative system combines the latest in AI vision models with vector search technology to enable semantic image retrieval – going far beyond simple keyword matching.

In this comprehensive guide, I’ll walk you through how to build this semantic image search engine in a Jupyter notebook, from setting up the vector database to implementing natural language image captioning. This project shares many similarities with our previous text-to-image tool, with a few minor differences.

System Architecture

The solution leverages three main components:

LangChain: Provides the framework for connecting various AI components
Milvus: A powerful vector database for storing and searching image embeddings
Ollama: Handles the image captioning and embedding generation

Prerequisite Knowledge

Langchain -> https://python.langchain.com/docs/tutorials/
Milvus -> https://milvus.io/docs/quickstart.md
Ollama -> https://ollama.com/
Python3/Jupyter Notebooks -> https://docs.python.org/3/tutorial/index.html
Vector Embeddings -> https://milvus.io/intro
Multimodal Models -> https://ollama.com/library/llava

TLDR

This Jupyter notebook provides a hands-on environment for building and experimenting with our image search tool.

Setup Phase

Install required libraries (LangChain, Ollama, Milvus)
Initialize Ollama embeddings model (nomic-embed-text)
Set up local Milvus Lite vector database

Image Processing Phase

Create functions to:
- Calculate unique file hashes for images
- Generate image captions using the LLaVA model
- Store image metadata and file paths

Database Population

Scan a directory for image files
Process images in batches of 50
Generate captions for each image using LLaVA
Create document objects with captions and metadata
Store embeddings in a Milvus vector database

Search Implementation

Find a query image
Generate a caption for the query image using LLaVA
Convert the caption to a vector embedding
Perform similarity searches in the vector database
Display results with similarity scores and images

Ollama setup

Before starting this image-to-image search project, you need to properly set up Ollama on your local system. First, download and install Ollama from the official website, which is available for MacOS, Windows, and Linux. After installation, you’ll need to pull two essential models: the LLaVA model for image captioning and the nomic-embed-text model for generating embeddings.

ollama pull llava

ollama pull nomic-embed-text

The LLaVA model is specifically designed for vision-related tasks and comes in different sizes. Additionally, you’ll need to ensure Ollama is running as a server, which typically operates on port 11434 by default.

Downloading an Image Dataset

We need to obtain an image dataset to serve as the foundation for our image-to-image search engine. To prepare your image dataset, start by downloading the Flickr30k Image Dataset from Kaggle. After downloading and unzipping the dataset, you’ll need to create a smaller subset for easier experimentation. Create a new directory (folder) named flickr30k_images_small in the same directory as flickr30k_images.

Once the new folder is created, you can randomly select 2000 images from the dataset and added them to the folder named flickr30k_images_small. On MacOS/Linux, you can use the command line to randomly select and copy a subset of images. Make sure to run this command from inside the flickr30k_images directory:

find . -type f | shuf -n 2000 | xargs -I {} cp {} ../flickr30k_images_small/

find . -type f: This part of the command searches for all files (-type f) in the current directory (.) and its subdirectories.
|: This is a pipe, which takes the output of the find command and passes it as input to the next command.
shuf -n 2000: The shuf command shuffles its input randomly. The -n 2000 option specifies that you want to select 2000 random lines from the input.
|: Another pipe, which passes the shuffled list of 2000 random files to the next command.
xargs -I {} cp {} ../flickr30k_images_small/: The xargs command builds and executes command lines from its input. The -I {} option tells xargs to replace {} with each input line (i.e., each file name). The cp {} command copies each selected file to the specified directory (../flickr30k_images_small/).

Setting up the environment with required libraries

First, we need to install the required packages.

%pip install -qU langchain langchain-ollama langchain-community langchain_milvus ollama

This command updates the packages in your Python environment to their latest versions, ensuring that the rest of the notebook’s code executes correctly. It’s a standard practice to place such installation commands at the beginning of a notebook to make sure all necessary dependencies are present.

Initialize Ollama Embeddings Model

from langchain_ollama import OllamaEmbeddings

embeddings = OllamaEmbeddings(model="nomic-embed-text")

In this snippet, we’re initializing the embedding model with LangChain’s Ollama integration. By employing the “nomic-embed-text” model, we can transform textual data into high-dimensional vectors. These vectors effectively encapsulate the semantic content of the text, enabling precise similarity assessments in our subsequent search operations.

Initialize Local Vector Database

from langchain_milvus import Milvus

# Local Milvus Lite Instance
URI="./image-to-image.db"

# Init vector store
vector_store = Milvus(
    embedding_function=embeddings,
    connection_args={"uri":URI},
    auto_id=True,
)

In this segment, we set up our vector database using Milvus Lite, a streamlined version of the Milvus vector database that operates locally. The database is set to store vectors at the specified URI path (“./image-to-image.db”) on your local machine. The embedding_function parameter links to our previously initialized Ollama embeddings model, while auto_id=True ensures that each vector entry is assigned a unique identifier automatically. This local database configuration provides efficient storage and retrieval of our vector embeddings without relying on external services or cloud infrastructure.

Import Necessary Libraries

import ollama
import hashlib
from pathlib import Path
from langchain.schema import Document
from tqdm import tqdm

File Hash Function

def get_file_hash(file_path):
    """Calculate SHA-256 hash of file"""
    sha256_hash = hashlib.sha256()
    with open(file_path, "rb") as f:
        for byte_block in iter(lambda: f.read(4096), b""):
            sha256_hash.update(byte_block)
    return sha256_hash.hexdigest()

This function, get_file_hash, is designed to calculate the SHA-256 hash of a file. Here’s a breakdown of what it does:

The function takes a file_path as an argument, which is the path to the file you want to hash.
It initializes a SHA-256 hash object using hashlib.sha256().
The file is opened in binary read mode (“rb”) using a context manager (with statement).
The function then reads the file in chunks of 4096 bytes (4KB) at a time. This is done to efficiently handle large files without loading the entire file into memory at once.
For each chunk (byte_block) read from the file, the update method of the hash object is called to update the hash with the new data.
After all chunks have been processed, the hexdigest() method is called on the hash object to get the final hash value as a hexadecimal string.
This hexadecimal string (the file’s SHA-256 hash) is then returned. The purpose of this function in the context of the larger script is to generate a unique identifier for each image file. This hash can be used to detect duplicate files or to verify file integrity. In the caption_and_store_images function, this hash is stored in the metadata of each document, which could be useful for tracking or referencing the original files in the vector store.

Image Captioning and Storage

def caption_and_store_images(image_folder):
    image_paths = list(Path(image_folder).glob("*"))
    documents = []
    
    for img_path in tqdm(image_paths):
        if img_path.suffix.lower() not in ['.jpg', '.jpeg', '.png']:
            continue
            
        # Correct way to access the response
        response = ollama.chat(
            model='llava',
            messages=[{
                'role': 'user',
                'content': 'Describe this image in a detailed single sentence.',
                'images': [str(img_path)]
            }]
        )
        
        caption = response['message']['content']
        file_hash = get_file_hash(img_path)
        
        doc = Document(
            page_content=caption,
            metadata={
                'image_path': str(img_path),
                'file_hash': file_hash
            }
        )
        documents.append(doc)
        
        if len(documents) >= 50:
            vector_store.add_documents(documents)
            documents = []
    
    if documents:
        vector_store.add_documents(documents)

The function takes an image_folder as input, which is expected to contain image files.
It uses Path(image_folder).glob(“*”) to get a list of all files in the specified folder.
The function then iterates through each file path in the folder using a for loop with tqdm for progress visualization.
For each file, it checks if the file extension is ‘.jpg’, ‘.jpeg’, or ‘.png’. If not, it skips to the next file.
For valid image files, it uses the ollama.chat function with the ‘llava’ model to generate a caption for the image. The prompt asks for a detailed single-sentence description of the image.
The generated caption is extracted from the response.
A file hash is generated for the image using the get_file_hash function (defined earlier in the code).
A Document object is created with the caption as the page_content and metadata including the image path and file hash.
This Document object is appended to a list called documents.
When the documents list reaches 50 items, they are added to the vector_store (a Milvus instance) using vector_store.add_documents(documents), and the documents list is cleared.
After processing all images, any remaining documents in the list are added to the vector store.

This function essentially processes a folder of images, generates captions for each image using an AI model, and stores these captions along with metadata in a vector database for later retrieval and similarity search.

Execute the caption_and_store_images() function

caption_and_store_images('flickr30k_images_small')

The caption_and_store_images function processes our collection of images stored in the ‘flickr30k_images_small’ directory. When executed, this function will systematically work through each image, generating captions using the LLaVA model and storing them in our Milvus vector database.

For our dataset of 2000 images, this process takes approximately 150 minutes on a standard laptop, as each image requires both caption generation and embedding computation. The progress bar helps you track the processing status as it moves through the dataset. If you’re working with a larger dataset, consider running this process during a break or overnight, as the processing time scales linearly with the number of images. For improved performance, the code could be modified to run the captioning and database insertion processes in parallel.

Image to Image Search

To test our image search system, we’ll need a query image. Download an ocean photo from any royalty-free image source and save it as ocean.png in your notebook’s directory.

We’ll use this image to generate a caption using LLaVA, following the same process we used for our flickr image dataset.

response = ollama.chat(
            model='llava',
            messages=[{
                'role': 'user',
                'content': 'Describe this image in a detailed single sentence.',
                'images': ['ocean.png']
            }]
        )

caption = str(response['message']['content'])

The final section of the code demonstrates how to perform a similarity search and display the results using the caption created from our ocean image.

from IPython.display import Image, display

First, the code imports necessary functions from IPython to display images in the notebook:

# The following code performs similarity search
results = vector_store.similarity_search_with_score(
    caption, k=3,
)

Next, we can perform a similarity search in the vector store for images that match the caption created from our ocean.png image. The k=3 parameter limits the results to the top 3 matches. When it comes to similarity scores, the interpretation depends on the specific similarity metric being used, but generally:

For cosine similarity and dot product:

A score closer to 1.0 indicates higher similarity
A score closer to 0 indicates lower similarity

For distance-based metrics (like Euclidean distance):

Larger scores indicate less similarity
A score closer to 0 indicates higher similarity

When creating a collection in Milvus, the default metric type is "COSINE" if not explicitly specified.

Once we have the results, we can display them in our Jupyter notebook:

   for res, score in results:
       print(f"* [SIM={score:3f}] {res.page_content}")
       image_path = res.metadata['image_path']
       display(Image(filename=image_path))
       print(f"Image path: {image_path}")
       print("---")

For each search result, this code:

Shows the similarity score (how closely the image matches the query)
Displays the AI-generated caption for the image
Renders the actual image in the notebook
Shows the image’s file location
Adds a visual separator between multiple results

This creates an interactive display where you can see both how well each image matches (via the similarity score) and visually verify the results.

This project showcases how modern AI can revolutionize image search by combining LLaVA’s vision capabilities with Milvus’s vector storage. The result is a powerful, privacy-focused system that runs entirely on your local machine. While the initial image processing requires time, the system delivers fast and intuitive image search through natural language.

The architecture is flexible and scalable – you can easily adapt it for larger datasets, reverse image search, or specialized image collections. The complete implementation is available in the provided Jupyter notebook, giving you a solid foundation to build your own semantic image search solution.

Full Code

# Import and initialize Ollama embeddings model
from langchain_ollama import OllamaEmbeddings
import ollama
import hashlib
from pathlib import Path
from langchain.schema import Document
from tqdm import tqdm
from langchain_milvus import Milvus
from IPython.display import Image, display

embeddings = OllamaEmbeddings(model="nomic-embed-text")

# Local Milvus Lite Instance
URI="./image-to-image.db"

# Init vector store
vector_store = Milvus(
    embedding_function=embeddings,
    connection_args={"uri":URI},
    auto_id=True,
)

def get_file_hash(file_path):
    """Calculate SHA-256 hash of file"""
    sha256_hash = hashlib.sha256()
    with open(file_path, "rb") as f:
        for byte_block in iter(lambda: f.read(4096), b""):
            sha256_hash.update(byte_block)

def caption_and_store_images(image_folder):
    image_paths = list(Path(image_folder).glob("*"))
    documents = []
    
    for img_path in tqdm(image_paths):
        if img_path.suffix.lower() not in ['.jpg', '.jpeg', '.png']:
            continue
            
        # Correct way to access the response
        response = ollama.chat(
            model='llava',
            messages=[{
                'role': 'user',
                'content': 'Describe this image in a detailed single sentence.',
                'images': [str(img_path)]
            }]
        )
        
        caption = response['message']['content']
        file_hash = get_file_hash(img_path)
        
        doc = Document(
            page_content=caption,
            metadata={
                'image_path': str(img_path),
                'file_hash': file_hash
            }
        )
        documents.append(doc)
        
        if len(documents) >= 50:
            vector_store.add_documents(documents)
            documents = []
    
    if documents:
        vector_store.add_documents(documents)

# Process the dataset
caption_and_store_images('flickr30k_images_small')

response = ollama.chat(
            model='llava',
            messages=[{
                'role': 'user',
                'content': 'Describe this image in a detailed single sentence.',
                'images': ['ocean.png']
            }]
        )

caption = str(response['message']['content'])

# The following code performs similarity search
results = vector_store.similarity_search_with_score(
    caption, k=3,
)

for res, score in results:
    print(f"* [SIM={score:3f}] {res.page_content}")
    
    # Display the image
    image_path = res.metadata['image_path']
    display(Image(filename=image_path))
    print(f"Image path: {image_path}")
    print("---")

Full Code Without Langchain

from pymilvus import MilvusClient
import ollama
import hashlib
from pathlib import Path
from tqdm import tqdm
from IPython.display import Image, display
from typing import List, Dict, Optional

def get_file_hash(file_path: str) -> str:
    """Calculate SHA-256 hash of file"""
    sha256_hash = hashlib.sha256()
    with open(file_path, "rb") as f:
        for byte_block in iter(lambda: f.read(4096), b""):
            sha256_hash.update(byte_block)
    return sha256_hash.hexdigest()

def initialize_vector_store() -> Optional[MilvusClient]:
    try:
        milvus_client = MilvusClient(uri="./no-lang-image-image.db")
        collection_name = "image_to_image"

        if milvus_client.has_collection(collection_name):
            milvus_client.drop_collection(collection_name)

        milvus_client.create_collection(
            collection_name=collection_name,
            dimension=768,
            metric_type="COSINE",
            consistency_level="Strong",
            auto_id=True
        )
        return milvus_client
    except Exception as e:
        print(f"Error initializing vector store: {e}")
        return None

def caption_and_store_images(image_folder: str, milvus_client: MilvusClient) -> None:
    image_paths = list(Path(image_folder).glob("*"))
    data = []

    for img_path in tqdm(image_paths):
        if img_path.suffix.lower() not in ['.jpg', '.jpeg', '.png']:
            continue

        try:
            # Get image caption
            response = ollama.chat(
                model='llava',
                messages=[{
                    'role': 'user',
                    'content': 'Describe this image in a detailed single sentence.',
                    'images': [str(img_path)]
                }]
            )
            
            caption = response['message']['content']
            file_hash = get_file_hash(str(img_path))

            # Get embedding for caption
            embedding_response = ollama.embeddings(
                model='nomic-embed-text',
                prompt=caption
            )
            embedding = embedding_response['embedding']

            data.append({
                "vector": embedding,
                "content": caption,
                "image_path": str(img_path),
                "file_hash": file_hash
            })

            # Batch insert when reaching 50 images
            if len(data) >= 50:
                milvus_client.insert(
                    collection_name="image_to_image",
                    data=data
                )
                data = []

        except Exception as e:
            print(f"Error processing image {img_path}: {e}")
            continue

    # Insert any remaining images
    if data:
        try:
            milvus_client.insert(
                collection_name="image_to_image",
                data=data
            )
        except Exception as e:
            print(f"Error in final batch insert: {e}")

def search_similar_images(query_image_path: str, milvus_client: MilvusClient, k: int = 3) -> None:
    try:
        # Get caption for query image
        response = ollama.chat(
            model='llava',
            messages=[{
                'role': 'user',
                'content': 'Describe this image in a detailed single sentence.',
                'images': [query_image_path]
            }]
        )
        caption = response['message']['content']
        print(f"Query Image Caption: {caption}\n")

        # Get embedding for caption
        embedding_response = ollama.embeddings(
            model='nomic-embed-text',
            prompt=caption
        )
        query_embedding = embedding_response['embedding']

        # Search for similar images
        results = milvus_client.search(
            collection_name="image_to_image",
            data=[query_embedding],
            limit=k,
            search_params={
                "metric_type": "COSINE",
                "params": {}
            },
            output_fields=["content", "image_path", "file_hash"]
        )

        # Display results
        for hit in results[0]:
            print(f"Similarity Score: {hit['distance']:.3f}")
            print(f"Caption: {hit['entity']['content']}")
            print(f"Image path: {hit['entity']['image_path']}")
            display(Image(filename=hit['entity']['image_path']))
            print("---")

    except Exception as e:
        print(f"Error during image search: {e}")

# Initialize vector store
milvus_client = initialize_vector_store()
if not milvus_client:
    print("Failed to initialize vector store")

# Process and store images
print("Processing and storing images...")
caption_and_store_images('image-samples-tiny', milvus_client)

# Search similar images
print("\nSearching for similar images...")
search_similar_images('ocean.png', milvus_client, k=5)