AI Research: Text To Image Search Systems

Text-to-image search represents a significant advancement in how we interact with visual data, moving beyond traditional keyword-based approaches to leverage natural language understanding and vector embeddings. By combining LangChain, Milvus, and Ollama, we can create an efficient system that transforms natural language descriptions into meaningful image searches. This implementation demonstrates how modern AI technologies can bridge the gap between textual descriptions and visual content, making image retrieval more intuitive and accessible.

System Architecture

The solution leverages three main components:

  • LangChain: Provides the framework for connecting various AI components
  • Milvus: A powerful vector database for storing and searching image embeddings
  • Ollama: Handles the image captioning and embedding generation
Prerequisite Knowledge
Overview

Our implementation utilizes a Jupyter notebook environment, providing an interactive and user-friendly interface for running and experimenting with the search tool. Through this notebook, we’ll walk through the entire process, from setting up the necessary components to performing actual searches.

Here’s a brief overview of what we’ll cover:
  1. Setting up the environment with required libraries
  2. Loading and preprocessing image data
  3. Using the LLaVA multimodal model through Ollama to caption images and store references to original image locations
  4. Generating vector embeddings for image captions using the nomic-embed-text model via Ollama
  5. Storing these embeddings in a local Milvus Lite vector database instance
  6. Implementing the search functionality
  7. Demonstrating text-to-image search capabilities

By the end of this tutorial, you’ll have a functional text-to-image search engine that can quickly retrieve visually relevant images based on textual descriptions. Whether you’re a developer, data scientist, or simply curious about AI-powered image search, this notebook will provide valuable insights into the workings of modern multimodal search systems.

Ollama setup

Before starting this text-to-image search project, you need to properly set up Ollama on your local system. First, download and install Ollama from the official website, which is available for MacOS, Windows, and Linux. After installation, you’ll need to pull two essential models: the LLaVA model for image captioning and the nomic-embed-text model for generating embeddings.

ollama pull llava
ollama pull nomic-embed-text

The LLaVA model is specifically designed for vision-related tasks and comes in different sizes (7B, 13B, or 34B). Additionally, you’ll need to ensure Ollama is running as a server, which typically operates on port 11434 by default.

Downloading an Image Dataset

We need to obtain an image dataset to serve as the foundation for our text-to-image search engine. To prepare your image dataset, start by downloading the Flickr30k Image Dataset from Kaggle. After downloading and unzipping the dataset, you’ll need to create a smaller subset for easier experimentation. Create a new directory (folder) named flickr30k_images_small in the same directory as flickr30k_images.

Once the new folder is created, you can randomly select 2000 images from the dataset and added them to the folder named flickr30k_images_small. On MacOS/Linux, you can use the command line to randomly select and copy a subset of images. Make sure to run this command from inside the flickr30k_images directory:

find . -type f | shuf -n 2000 | xargs -I {} cp {} ../flickr30k_images_small/
  • find . -type f: This part of the command searches for all files (-type f) in the current directory (.) and its subdirectories.
  • |: This is a pipe, which takes the output of the find command and passes it as input to the next command.
  • shuf -n 2000: The shuf command shuffles its input randomly. The -n 2000 option specifies that you want to select 2000 random lines from the input.
  • |: Another pipe, which passes the shuffled list of 2000 random files to the next command.
  • xargs -I {} cp {} ../flickr30k_images_small/: The xargs command builds and executes command lines from its input. The -I {} option tells xargs to replace {} with each input line (i.e., each file name). The cp {} command copies each selected file to the specified directory (../flickr30k_images_small/).
Setting up the environment with required libraries

First, we need to install the required packages.

!pip install -qU langchain langchain_milvus langchain-ollama ollama

This command ensures that the latest versions of these packages are installed in your Python environment, which are necessary for the rest of the code in the notebook to run properly. It’s a common practice to include such installation commands at the beginning of a notebook to ensure all required dependencies are available.

Initialize Ollama Embeddings Model
from langchain_ollama import OllamaEmbeddings

embeddings = OllamaEmbeddings(model="nomic-embed-text")

This code block initializes the embedding model using LangChain’s Ollama integration. We use the “nomic-embed-text” model, which is specifically designed to convert text into high-dimensional vector representations. These vectors capture the semantic meaning of text, allowing for efficient similarity comparisons later in our search system.

Initialize Local Vector Database
from langchain_milvus import Milvus

# Local Milvus Lite Instance
URI="./textoimage-search.db"

# Init vector store
vector_store = Milvus(
    embedding_function=embeddings,
    connection_args={"uri":URI},
    auto_id=True,
)

We then set up our vector database using Milvus Lite, a lightweight version of the Milvus vector database that runs locally. The database is configured to store vectors at the specified URI path (“./textoimage-search.db”) on your local system. The embedding_function parameter connects our previously initialized Ollama embeddings model, while auto_id=True ensures each vector entry receives a unique identifier automatically. This local database setup provides efficient storage and retrieval of our vector embeddings without requiring external services or cloud infrastructure.

Import Necessary Libraries
import ollama
import hashlib
from pathlib import Path
from langchain.schema import Document
from tqdm import tqdm
File Hash Function
def get_file_hash(file_path):
    """Calculate SHA-256 hash of file"""
    sha256_hash = hashlib.sha256()
    with open(file_path, "rb") as f:
        for byte_block in iter(lambda: f.read(4096), b""):
            sha256_hash.update(byte_block)
    return sha256_hash.hexdigest()

This function, get_file_hash, is designed to calculate the SHA-256 hash of a file. Here’s a breakdown of what it does:

  1. The function takes a file_path as an argument, which is the path to the file you want to hash.
  2. It initializes a SHA-256 hash object using hashlib.sha256().
  3. The file is opened in binary read mode (“rb”) using a context manager (with statement).
  4. The function then reads the file in chunks of 4096 bytes (4KB) at a time. This is done to efficiently handle large files without loading the entire file into memory at once.
  5. For each chunk (byte_block) read from the file, the update method of the hash object is called to update the hash with the new data.
  6. After all chunks have been processed, the hexdigest() method is called on the hash object to get the final hash value as a hexadecimal string.
  7. This hexadecimal string (the file’s SHA-256 hash) is then returned. The purpose of this function in the context of the larger script is to generate a unique identifier for each image file. This hash can be used to detect duplicate files or to verify file integrity. In the caption_and_store_images function, this hash is stored in the metadata of each document, which could be useful for tracking or referencing the original files in the vector store.
Image Captioning and Storage
def caption_and_store_images(image_folder):
    image_paths = list(Path(image_folder).glob("*"))
    documents = []
    
    for img_path in tqdm(image_paths):
        if img_path.suffix.lower() not in ['.jpg', '.jpeg', '.png']:
            continue
            
        # Correct way to access the response
        response = ollama.chat(
            model='llava',
            messages=[{
                'role': 'user',
                'content': 'Describe this image in a detailed single sentence.',
                'images': [str(img_path)]
            }]
        )
        
        caption = response['message']['content']
        file_hash = get_file_hash(img_path)
        
        doc = Document(
            page_content=caption,
            metadata={
                'image_path': str(img_path),
                'file_hash': file_hash
            }
        )
        documents.append(doc)
        
        if len(documents) >= 50:
            vector_store.add_documents(documents)
            documents = []
    
    if documents:
        vector_store.add_documents(documents)
  1. The function takes an image_folder as input, which is expected to contain image files.
  2. It uses Path(image_folder).glob(“*”) to get a list of all files in the specified folder.
  3. The function then iterates through each file path in the folder using a for loop with tqdm for progress visualization.
  4. For each file, it checks if the file extension is ‘.jpg’, ‘.jpeg’, or ‘.png’. If not, it skips to the next file.
  5. For valid image files, it uses the ollama.chat function with the ‘llava’ model to generate a caption for the image. The prompt asks for a detailed single-sentence description of the image.
  6. The generated caption is extracted from the response.
  7. A file hash is generated for the image using the get_file_hash function (defined earlier in the code).
  8. A Document object is created with the caption as the page_content and metadata including the image path and file hash.
  9. This Document object is appended to a list called documents.
  10. When the documents list reaches 50 items, they are added to the vector_store (a Milvus instance) using vector_store.add_documents(documents), and the documents list is cleared.
  11. After processing all images, any remaining documents in the list are added to the vector store.

This function essentially processes a folder of images, generates captions for each image using an AI model, and stores these captions along with metadata in a vector database for later retrieval and similarity search.

Execute the caption_and_store_images() function
caption_and_store_images('flickr30k_images_small')

The caption_and_store_images function processes our collection of images stored in the ‘flickr30k_images_small’ directory. When executed, this function will systematically work through each image, generating captions using the LLaVA model and storing them in our Milvus vector database.

For our dataset of 2000 images, this process takes approximately 120 minutes on a standard laptop, as each image requires both caption generation and embedding computation. The progress bar helps you track the processing status as it moves through the dataset. If you’re working with a larger dataset, consider running this process during a break or overnight, as the processing time scales linearly with the number of images. For improved performance, the code could be modified to run the captioning and database insertion processes in parallel.

Text to Image Search
from IPython.display import Image, display

# The following code performs similarity search
results = vector_store.similarity_search_with_score(
    "A mountain in the background", k=3,
)

for res, score in results:
    print(f"* [SIM={score:3f}] {res.page_content}")
    
    # Display the image
    image_path = res.metadata['image_path']
    display(Image(filename=image_path))
    print(f"Image path: {image_path}")
    print("---")

This final section of the code demonstrates how to perform a similarity search and display the results in a Jupyter notebook. Here’s a breakdown of what’s happening:

First, the code imports necessary functions from IPython to display images in the notebook. The code then proceeds to perform Similarity Search:

       results = vector_store.similarity_search_with_score(
           "A mountain in the background", k=3,
       )

    This line performs a similarity search in the vector store for images that match the query “A mountain in the background”. The k=3 parameter limits the results to the top 3 matches. When it comes to similarity scores, the interpretation depends on the specific similarity metric being used, but generally:

    For cosine similarity and dot product:

    • A score closer to 1.0 indicates higher similarity
    • A score closer to 0 indicates lower similarity

    For distance-based metrics (like Euclidean distance):

    • Larger scores indicate less similarity
    • A score closer to 0 indicates higher similarity

    When creating a collection in Milvus, the default metric type is "COSINE" if not explicitly specified. The code then iterates through the search results, displaying images for each match:

         for res, score in results:
             print(f"* [SIM={score:3f}] {res.page_content}")
             image_path = res.metadata['image_path']
             display(Image(filename=image_path))
             print(f"Image path: {image_path}")
             print("---")

      This section effectively demonstrates the end-to-end functionality of the text-to-image search system, from querying to result visualization. Users can easily modify the search query to explore different results based on textual descriptions.

      Looking ahead, this text-to-image search system demonstrates the powerful capabilities of combining modern AI models with vector databases. By leveraging Ollama’s LLaVA model for image captioning and Milvus for efficient vector storage, we’ve created a fully local, privacy-respecting solution for semantic image search.

      The system can be easily extended to handle larger datasets or modified to support different use cases, such as reverse image search or custom image collections. While the initial processing time may be significant, the resulting search capabilities make it a valuable tool for managing and exploring image collections through natural language queries.

      Feel free to experiment with different search queries, modify the code to suit your needs, or scale the system to handle your specific image collection. The complete code for this project is available in the accompanying Jupyter notebook, making it straightforward to get started with your own implementation.

      Full Code
      from langchain_ollama import OllamaEmbeddings
      import ollama
      import hashlib
      from pathlib import Path
      from langchain.schema import Document
      from tqdm import tqdm
      from langchain_milvus import Milvus
      
      embeddings = OllamaEmbeddings(model="nomic-embed-text")
      
      # Local Milvus Lite Instance
      URI="./textoimage-search.db"
      
      # Init vector store
      vector_store = Milvus(
          embedding_function=embeddings,
          connection_args={"uri":URI},
          auto_id=True,
      )
      
      def get_file_hash(file_path):
          """Calculate SHA-256 hash of file"""
          sha256_hash = hashlib.sha256()
          with open(file_path, "rb") as f:
              for byte_block in iter(lambda: f.read(4096), b""):
                  sha256_hash.update(byte_block)
          return sha256_hash.hexdigest()
      
      def caption_and_store_images(image_folder):
          image_paths = list(Path(image_folder).glob("*"))
          documents = []
          
          for img_path in tqdm(image_paths):
              if img_path.suffix.lower() not in ['.jpg', '.jpeg', '.png']:
                  continue
                  
              # Correct way to access the response
              response = ollama.chat(
                  model='llava',
                  messages=[{
                      'role': 'user',
                      'content': 'Describe this image in a detailed single sentence.',
                      'images': [str(img_path)]
                  }]
              )
              
              caption = response['message']['content']
              file_hash = get_file_hash(img_path)
              
              doc = Document(
                  page_content=caption,
                  metadata={
                      'image_path': str(img_path),
                      'file_hash': file_hash
                  }
              )
              documents.append(doc)
              
              if len(documents) >= 50:
                  vector_store.add_documents(documents)
                  documents = []
          
          if documents:
              vector_store.add_documents(documents)
      
      # Process the dataset
      caption_and_store_images('flickr30k_images_small')
      
      # The following code performs similarity search
      results = vector_store.similarity_search_with_score(
          "A mountain in the background", k=3,
      )
      
      for res, score in results:
          print(f"* [SIM={score:3f}] {res.page_content}")
          
          image_path = res.metadata['image_path']
          print(f"Image path: {image_path}")
          print("---")

      No responses yet

      Leave a Reply

      Your email address will not be published. Required fields are marked *

      Latest Comments

      No comments to show.