Text-to-image search represents a significant advancement in how we interact with visual data, moving beyond traditional keyword-based approaches to leverage natural language understanding and vector embeddings. By combining LangChain, Milvus, and Ollama, we can create an efficient system that transforms natural language descriptions into meaningful image searches. This implementation demonstrates how modern AI technologies can bridge the gap between textual descriptions and visual content, making image retrieval more intuitive and accessible.
System Architecture
The solution leverages three main components:
- LangChain: Provides the framework for connecting various AI components
- Milvus: A powerful vector database for storing and searching image embeddings
- Ollama: Handles the image captioning and embedding generation
Prerequisite Knowledge
- Langchain -> https://python.langchain.com/docs/tutorials/
- Milvus -> https://milvus.io/docs/quickstart.md
- Ollama -> https://ollama.com/
- Python3/Jupyter Notebooks -> https://docs.python.org/3/tutorial/index.html
- Vector Embeddings -> https://milvus.io/intro
- Multimodal Models -> https://ollama.com/library/llava
Overview
Our implementation utilizes a Jupyter notebook environment, providing an interactive and user-friendly interface for running and experimenting with the search tool. Through this notebook, we’ll walk through the entire process, from setting up the necessary components to performing actual searches.
Here’s a brief overview of what we’ll cover:
- Setting up the environment with required libraries
- Loading and preprocessing image data
- Using the LLaVA multimodal model through Ollama to caption images and store references to original image locations
- Generating vector embeddings for image captions using the nomic-embed-text model via Ollama
- Storing these embeddings in a local Milvus Lite vector database instance
- Implementing the search functionality
- Demonstrating text-to-image search capabilities
By the end of this tutorial, you’ll have a functional text-to-image search engine that can quickly retrieve visually relevant images based on textual descriptions. Whether you’re a developer, data scientist, or simply curious about AI-powered image search, this notebook will provide valuable insights into the workings of modern multimodal search systems.
Ollama setup
Before starting this text-to-image search project, you need to properly set up Ollama on your local system. First, download and install Ollama from the official website, which is available for MacOS, Windows, and Linux. After installation, you’ll need to pull two essential models: the LLaVA model for image captioning and the nomic-embed-text model for generating embeddings.
ollama pull llava
ollama pull nomic-embed-text
The LLaVA model is specifically designed for vision-related tasks and comes in different sizes (7B, 13B, or 34B). Additionally, you’ll need to ensure Ollama is running as a server, which typically operates on port 11434 by default.
Downloading an Image Dataset
We need to obtain an image dataset to serve as the foundation for our text-to-image search engine. To prepare your image dataset, start by downloading the Flickr30k Image Dataset from Kaggle. After downloading and unzipping the dataset, you’ll need to create a smaller subset for easier experimentation. Create a new directory (folder) named flickr30k_images_small
in the same directory as flickr30k_images
.
Once the new folder is created, you can randomly select 2000 images from the dataset and added them to the folder named flickr30k_images_small. On MacOS/Linux, you can use the command line to randomly select and copy a subset of images. Make sure to run this command from inside the flickr30k_images
directory:
find . -type f | shuf -n 2000 | xargs -I {} cp {} ../flickr30k_images_small/
find . -type f
: This part of the command searches for all files (-type f
) in the current directory (.
) and its subdirectories.|
: This is a pipe, which takes the output of thefind
command and passes it as input to the next command.shuf -n 2000
: Theshuf
command shuffles its input randomly. The-n 2000
option specifies that you want to select 2000 random lines from the input.|
: Another pipe, which passes the shuffled list of 2000 random files to the next command.xargs -I {} cp {} ../flickr30k_images_small/
: Thexargs
command builds and executes command lines from its input. The-I {}
option tellsxargs
to replace{}
with each input line (i.e., each file name). Thecp {}
command copies each selected file to the specified directory (../flickr30k_images_small/
).
Setting up the environment with required libraries
First, we need to install the required packages.
!pip install -qU langchain langchain_milvus langchain-ollama ollama
This command ensures that the latest versions of these packages are installed in your Python environment, which are necessary for the rest of the code in the notebook to run properly. It’s a common practice to include such installation commands at the beginning of a notebook to ensure all required dependencies are available.
Initialize Ollama Embeddings Model
from langchain_ollama import OllamaEmbeddings
embeddings = OllamaEmbeddings(model="nomic-embed-text")
This code block initializes the embedding model using LangChain’s Ollama integration. We use the “nomic-embed-text” model, which is specifically designed to convert text into high-dimensional vector representations. These vectors capture the semantic meaning of text, allowing for efficient similarity comparisons later in our search system.
Initialize Local Vector Database
from langchain_milvus import Milvus
# Local Milvus Lite Instance
URI="./textoimage-search.db"
# Init vector store
vector_store = Milvus(
embedding_function=embeddings,
connection_args={"uri":URI},
auto_id=True,
)
We then set up our vector database using Milvus Lite, a lightweight version of the Milvus vector database that runs locally. The database is configured to store vectors at the specified URI path (“./textoimage-search.db”) on your local system. The embedding_function
parameter connects our previously initialized Ollama embeddings model, while auto_id=True
ensures each vector entry receives a unique identifier automatically. This local database setup provides efficient storage and retrieval of our vector embeddings without requiring external services or cloud infrastructure.
Import Necessary Libraries
import ollama
import hashlib
from pathlib import Path
from langchain.schema import Document
from tqdm import tqdm
File Hash Function
def get_file_hash(file_path):
"""Calculate SHA-256 hash of file"""
sha256_hash = hashlib.sha256()
with open(file_path, "rb") as f:
for byte_block in iter(lambda: f.read(4096), b""):
sha256_hash.update(byte_block)
return sha256_hash.hexdigest()
This function, get_file_hash, is designed to calculate the SHA-256 hash of a file. Here’s a breakdown of what it does:
- The function takes a file_path as an argument, which is the path to the file you want to hash.
- It initializes a SHA-256 hash object using hashlib.sha256().
- The file is opened in binary read mode (“rb”) using a context manager (with statement).
- The function then reads the file in chunks of 4096 bytes (4KB) at a time. This is done to efficiently handle large files without loading the entire file into memory at once.
- For each chunk (byte_block) read from the file, the update method of the hash object is called to update the hash with the new data.
- After all chunks have been processed, the hexdigest() method is called on the hash object to get the final hash value as a hexadecimal string.
- This hexadecimal string (the file’s SHA-256 hash) is then returned. The purpose of this function in the context of the larger script is to generate a unique identifier for each image file. This hash can be used to detect duplicate files or to verify file integrity. In the caption_and_store_images function, this hash is stored in the metadata of each document, which could be useful for tracking or referencing the original files in the vector store.
Image Captioning and Storage
def caption_and_store_images(image_folder):
image_paths = list(Path(image_folder).glob("*"))
documents = []
for img_path in tqdm(image_paths):
if img_path.suffix.lower() not in ['.jpg', '.jpeg', '.png']:
continue
# Correct way to access the response
response = ollama.chat(
model='llava',
messages=[{
'role': 'user',
'content': 'Describe this image in a detailed single sentence.',
'images': [str(img_path)]
}]
)
caption = response['message']['content']
file_hash = get_file_hash(img_path)
doc = Document(
page_content=caption,
metadata={
'image_path': str(img_path),
'file_hash': file_hash
}
)
documents.append(doc)
if len(documents) >= 50:
vector_store.add_documents(documents)
documents = []
if documents:
vector_store.add_documents(documents)
- The function takes an image_folder as input, which is expected to contain image files.
- It uses Path(image_folder).glob(“*”) to get a list of all files in the specified folder.
- The function then iterates through each file path in the folder using a for loop with tqdm for progress visualization.
- For each file, it checks if the file extension is ‘.jpg’, ‘.jpeg’, or ‘.png’. If not, it skips to the next file.
- For valid image files, it uses the ollama.chat function with the ‘llava’ model to generate a caption for the image. The prompt asks for a detailed single-sentence description of the image.
- The generated caption is extracted from the response.
- A file hash is generated for the image using the get_file_hash function (defined earlier in the code).
- A Document object is created with the caption as the page_content and metadata including the image path and file hash.
- This Document object is appended to a list called documents.
- When the documents list reaches 50 items, they are added to the vector_store (a Milvus instance) using vector_store.add_documents(documents), and the documents list is cleared.
- After processing all images, any remaining documents in the list are added to the vector store.
This function essentially processes a folder of images, generates captions for each image using an AI model, and stores these captions along with metadata in a vector database for later retrieval and similarity search.
Execute the caption_and_store_images() function
caption_and_store_images('flickr30k_images_small')
The caption_and_store_images
function processes our collection of images stored in the ‘flickr30k_images_small’ directory. When executed, this function will systematically work through each image, generating captions using the LLaVA model and storing them in our Milvus vector database.
For our dataset of 2000 images, this process takes approximately 120 minutes on a standard laptop, as each image requires both caption generation and embedding computation. The progress bar helps you track the processing status as it moves through the dataset. If you’re working with a larger dataset, consider running this process during a break or overnight, as the processing time scales linearly with the number of images. For improved performance, the code could be modified to run the captioning and database insertion processes in parallel.
Text to Image Search
from IPython.display import Image, display
# The following code performs similarity search
results = vector_store.similarity_search_with_score(
"A mountain in the background", k=3,
)
for res, score in results:
print(f"* [SIM={score:3f}] {res.page_content}")
# Display the image
image_path = res.metadata['image_path']
display(Image(filename=image_path))
print(f"Image path: {image_path}")
print("---")
This final section of the code demonstrates how to perform a similarity search and display the results in a Jupyter notebook. Here’s a breakdown of what’s happening:
First, the code imports necessary functions from IPython to display images in the notebook. The code then proceeds to perform Similarity Search:
results = vector_store.similarity_search_with_score(
"A mountain in the background", k=3,
)
This line performs a similarity search in the vector store for images that match the query “A mountain in the background”. The k=3
parameter limits the results to the top 3 matches. When it comes to similarity scores, the interpretation depends on the specific similarity metric being used, but generally:
For cosine similarity and dot product:
- A score closer to 1.0 indicates higher similarity
- A score closer to 0 indicates lower similarity
For distance-based metrics (like Euclidean distance):
When creating a collection in Milvus, the default metric type is "COSINE"
if not explicitly specified. The code then iterates through the search results, displaying images for each match:
for res, score in results:
print(f"* [SIM={score:3f}] {res.page_content}")
image_path = res.metadata['image_path']
display(Image(filename=image_path))
print(f"Image path: {image_path}")
print("---")
This section effectively demonstrates the end-to-end functionality of the text-to-image search system, from querying to result visualization. Users can easily modify the search query to explore different results based on textual descriptions.
Looking ahead, this text-to-image search system demonstrates the powerful capabilities of combining modern AI models with vector databases. By leveraging Ollama’s LLaVA model for image captioning and Milvus for efficient vector storage, we’ve created a fully local, privacy-respecting solution for semantic image search.
The system can be easily extended to handle larger datasets or modified to support different use cases, such as reverse image search or custom image collections. While the initial processing time may be significant, the resulting search capabilities make it a valuable tool for managing and exploring image collections through natural language queries.
Feel free to experiment with different search queries, modify the code to suit your needs, or scale the system to handle your specific image collection. The complete code for this project is available in the accompanying Jupyter notebook, making it straightforward to get started with your own implementation.
Full Code
from langchain_ollama import OllamaEmbeddings
import ollama
import hashlib
from pathlib import Path
from langchain.schema import Document
from tqdm import tqdm
from langchain_milvus import Milvus
embeddings = OllamaEmbeddings(model="nomic-embed-text")
# Local Milvus Lite Instance
URI="./textoimage-search.db"
# Init vector store
vector_store = Milvus(
embedding_function=embeddings,
connection_args={"uri":URI},
auto_id=True,
)
def get_file_hash(file_path):
"""Calculate SHA-256 hash of file"""
sha256_hash = hashlib.sha256()
with open(file_path, "rb") as f:
for byte_block in iter(lambda: f.read(4096), b""):
sha256_hash.update(byte_block)
return sha256_hash.hexdigest()
def caption_and_store_images(image_folder):
image_paths = list(Path(image_folder).glob("*"))
documents = []
for img_path in tqdm(image_paths):
if img_path.suffix.lower() not in ['.jpg', '.jpeg', '.png']:
continue
# Correct way to access the response
response = ollama.chat(
model='llava',
messages=[{
'role': 'user',
'content': 'Describe this image in a detailed single sentence.',
'images': [str(img_path)]
}]
)
caption = response['message']['content']
file_hash = get_file_hash(img_path)
doc = Document(
page_content=caption,
metadata={
'image_path': str(img_path),
'file_hash': file_hash
}
)
documents.append(doc)
if len(documents) >= 50:
vector_store.add_documents(documents)
documents = []
if documents:
vector_store.add_documents(documents)
# Process the dataset
caption_and_store_images('flickr30k_images_small')
# The following code performs similarity search
results = vector_store.similarity_search_with_score(
"A mountain in the background", k=3,
)
for res, score in results:
print(f"* [SIM={score:3f}] {res.page_content}")
image_path = res.metadata['image_path']
print(f"Image path: {image_path}")
print("---")
No responses yet