The quest for more accurate and meaningful information retrieval has led to significant advancements in search technologies. Traditional lexical search, which relies on exact keyword matches, has long been the cornerstone of information retrieval. However, as the volume of data continues to grow exponentially, the limitations of lexical search become increasingly apparent. Enter semantic search—a groundbreaking approach that leverages natural language processing (NLP) to understand the meaning and context behind queries.
Prerequisite Knowledge
- Langchain -> https://python.langchain.com/docs/tutorials/
- Milvus -> https://milvus.io/docs/quickstart.md
- Ollama -> https://ollama.com/
- Python3/Jupyter Notebooks -> https://docs.python.org/3/tutorial/index.html
- Vector Embeddings -> https://milvus.io/intro
Traditional Search Example
Imagine you have a database of articles and you search for the term “dog training tips.”
- Query: “dog training tips”
- Results:
- An article titled “Tips for Training a Dog” which directly matches the search query.
- A blog post that mentions “dog” in one sentence and “training tips” in another, but they are unrelated.
- A list of tips that includes the word “dog” but is not specifically about training dogs.
Lexical search matches the exact keywords “dog,” “training,” and “tips,” regardless of their context within the documents.
Semantic Search Example
- Query: “dog training tips”
- Results:
- A comprehensive guide on how to train your Goldendoodle with various techniques and tips.
- An article about positive reinforcement methods for Border Collies.
- A blog post discussing effective housebreaking strategies for puppies of different breeds.
In this semantic search example, the search engine understands that “dog training tips” involves practical advice and techniques tailored to different dog breeds. It provides results that are contextually relevant and helpful, even though the exact phrase “dog training tips” may not be present in every result.
This demonstrates how semantic search can offer more meaningful and contextually appropriate results by understanding the intent and nuances behind the query, rather than just matching keywords.
Advantages of Semantic Search:
- Understanding Context:
- Semantic search understands the context and meaning behind the words, rather than just matching keywords. For example, if you search for “heart attack symptoms,” a semantic search engine can return results about “myocardial infarction” symptoms as well.
- Handling Synonyms and Related Terms:
- It recognizes synonyms and related terms, which means it can return relevant results even if the exact keywords are not present in the documents. For instance, searching for “automobile” will also find documents containing “car.”
- Improved Relevance:
- Because it understands the meaning of the query, semantic search can provide more relevant and accurate results. This can improve the user experience significantly.
- Natural Language Processing:
- Semantic search can handle natural language queries better. Users can ask questions in a conversational manner, and the search engine can understand and provide appropriate responses.
- Ranking by Meaning:
- Results are ranked based on their relevance to the meaning of the query, rather than just the presence of keywords. This helps in finding the most meaningful and useful information quickly.
Advantages of Traditional Lexical Search:
- Speed:
- Lexical searches are often faster because they rely on simple keyword matching and indexing.
- Simplicity:
- It’s straightforward and easy to implement, especially for small datasets or simple search tasks.
- Exact Match:
- If the exact term or phrase is important, lexical search can be more precise in returning exact matches.
Both methods have their own strengths, but semantic search is particularly powerful for understanding and retrieving information based on meaning and context. This can be extremely useful for complex queries and large datasets.
Creating a Semantic Search Tool
Lets create a new Python3 Jupyter Notebook to demonstrate how to do text-to-text semantic search using Python3, Langchain, and Milvus. This setup is perfect for diving into the nuances of semantic search and how it can significantly improve the way we retrieve information.
Ollama setup
Before starting this semantic search project, you need to properly set up Ollama on your local system. First, download and install Ollama from the official website, which is available for MacOS, Windows, and Linux. After installation, you’ll need to download the nomic-embed-text model.
ollama pull nomic-embed-text
Downloading A Book For Semantic Search
To test our semantic search, we’ll use 'The Fellowship of the Ring'
as our sample text. You can download a text version of this book from the Internet Archive. Once downloaded, extract the contents of the file if it is compressed.
macOS
Using Archive Utility:
- Right-click the
.gz
file. - Select Open With > Archive Utility.
Using Terminal:
- Open Terminal from Applications > Utilities.
- Navigate to the directory containing the
.gz
file usingcd
command. - Run
gunzip j-r-r-tolkien-lord-of-the-rings-01-the-fellowship-of-the-ring-retail-pdf_hocr_searchtext.txt.gz
to extract the file.
Linux
Using Terminal:
- Open Terminal .
- Navigate to the directory containing the
.gz
file usingcd
command. - Run
gunzip j-r-r-tolkien-lord-of-the-rings-01-the-fellowship-of-the-ring-retail-pdf_hocr_searchtext.txt.gz
to extract the file.
Using GUI:
- Right-click the
.gz
file. - Select Extract Here or Extract to.
Windows
Using 7-Zip:
- Download and install 7-Zip.
- Right-click the
.gz
file Gz File in Linux. - Select 7-Zip > Extract Here.
Once the file has been extracted, move it to the same directory as your Jupyter Notebook and rename it to fellowship-of-the-ring.txt
.
Setting up the environment with required libraries
Next, we need to install the required packages.
!pip install -qU langchain langchain_milvus langchain-ollama ollama
This command ensures that the latest versions of these packages are installed in your Python environment, which are necessary for the rest of the code in the notebook to run properly.
Import Necessary Libraries
# Import and initialize Ollama embeddings model
from langchain_ollama import OllamaEmbeddings
from langchain_milvus import Milvus
from langchain_community.document_loaders import DirectoryLoader
Loading The Document
loader = TextLoader("./fellowship-of-the-ring.txt")
text_documents = loader.load()
This code snippet loads the content of a text file named “fellowship-of-the-ring.txt” into memory for further processing. It uses the TextLoader
class from the Langchain library to create a loader object specifically designed to handle text files. The load()
method is then called on this loader object, which reads the entire contents of the specified file and returns it as a list of document objects. Each document object typically contains the text content along with any metadata associated with the file. In this case, the loaded text, presumably the content of J.R.R. Tolkien’s “The Fellowship of the Ring,” is stored in the text_documents
variable, making it available for subsequent operations such as text splitting, embedding generation, or semantic analysis.
Splitting The Text
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
documents = text_splitter.split_documents(text_documents)
A text splitter is like a document chopper – it breaks down long texts into smaller, manageable pieces (chunks). The RecursiveCharacterTextSplitter
creates chunks of text with these properties:
- Each chunk is approximately 1000 characters long (
chunk_size=1000
) - Each chunk overlaps with the next chunk by 200 characters (
chunk_overlap=200
)
Why Use Overlapping Chunks?
The overlap ensures that sentences or concepts that might be split between chunks aren’t lost. For example:
Original text: "Gandalf the Grey was a powerful wizard who helped the hobbits."
Chunk 1: "Gandalf the Grey was a powerful"
Chunk 2: "was a powerful wizard who helped"
Chunk 3: "wizard who helped the hobbits."
The overlap (shown in the repeated words) helps maintain context and prevents important information from being cut off at chunk boundaries. This is particularly important for semantic search, as it ensures that related concepts stay together even when the text is split.
The final line documents = text_splitter.split_documents(text_documents)
applies this splitting process to your loaded text, creating a list of smaller document chunks that are easier to process and analyze.
Vector Database Setup
embeddings = OllamaEmbeddings(model="nomic-embed-text")
# Local Milvus Lite Instance
URI="./semantic-searches.db"
# Init vector store
vector_store = Milvus(
embedding_function=embeddings,
connection_args={"uri":URI},
auto_id=True,
)
This code sets up the core components for turning text into searchable vectors. Let’s break it down:
Embedding Creation
embeddings = OllamaEmbeddings(model="nomic-embed-text")
This line creates an embedding function using the “nomic-embed-text” model from Ollama. Think of this as a translator that converts words and sentences into numbers (vectors) that a computer can understand and compare.
Database Setup
URI="./semantic-searches.db"
This specifies where the vector database will be stored on your computer.
Vector Store Initialization
vector_store = Milvus(
embedding_function=embeddings,
connection_args={"uri":URI},
auto_id=True,
)
This creates a new vector store using Milvus, which is like a special filing cabinet for vectors. It’s set up with:
- The embedding function we created earlier to convert text to vectors
- The location where it should store its data (the URI we defined)
auto_id=True
, which means it will automatically assign unique IDs to each vector it stores
Upload The Documents To The Vector Store
vector_store.add_documents(documents)
This line takes all the previously split text chunks (documents) and adds them to the Milvus vector store by converting each chunk into vector embeddings and storing them in the database for later searching.
Semantic Search
# The following code performs similarity search
results = vector_store.similarity_search_with_score(
"Wizard", k=3,
)
for res, score in results:
print(f"* [SIM={score:3f}] {res.page_content}")
print('-----------')
This code performs a semantic search for the word “Wizard” in the stored text, retrieving the 3 most similar passages (k=3), and then prints each result with its similarity score. The loop displays each matching text passage along with a numerical score showing how closely it matches the search term:
The interpretation of similarity scores depends on the specific similarity metric being used, but generally:
For cosine similarity and dot product:
- A score closer to 1.0 indicates higher similarity
- A score closer to 0 indicates lower similarity
For distance-based metrics (like Euclidean distance):
When creating a collection in Milvus, the default metric type is "COSINE"
if not explicitly specified.
Bringing It All Together
Semantic search takes text analysis to a whole new level, moving beyond simple keyword matching. With just a few lines of code using Langchain, Ollama, and Milvus, we’ve built a powerful search engine that understands the meaning behind words. Our implementation can process entire books, break them into manageable chunks, and find relevant passages based on their semantic meaning rather than exact word matches. Semantic search has numerous practical applications across various industries, such as:
Financial Applications
- Real-time analysis of market reports and financial documents
- Cross-referencing financial metrics across multiple sources
- Identifying market trends and patterns in news articles
Enterprise Knowledge Management
- Centralized access to corporate documentation
- Intelligent document retrieval across departments
- Automated knowledge base organization and search
AI and Machine Learning
- Enhanced RAG implementations for chatbots
- Contextual recommendation engines
- Improved document summarization systems
Full Code
# Import and initialize Ollama embeddings model
from langchain_ollama import OllamaEmbeddings
from langchain_milvus import Milvus
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
loader = TextLoader("./fellowship-of-the-ring.txt")
text_documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
documents = text_splitter.split_documents(text_documents)
embeddings = OllamaEmbeddings(model="nomic-embed-text")
# Local Milvus Lite Instance
URI="./semantic-searches.db"
# Init vector store
vector_store = Milvus(
embedding_function=embeddings,
connection_args={"uri":URI},
auto_id=True,
)
vector_store.add_documents(documents)
# The following code performs similarity search
results = vector_store.similarity_search_with_score(
"Wizard", k=3,
)
for res, score in results:
print(f"* [SIM={score:3f}] {res.page_content}")
print('-----------')
No responses yet