From Lexical to Semantic: Advancements in AI-Driven Text Search

The quest for more accurate and meaningful information retrieval has led to significant advancements in search technologies. Traditional lexical search, which relies on exact keyword matches, has long been the cornerstone of information retrieval. However, as the volume of data continues to grow exponentially, the limitations of lexical search become increasingly apparent. Enter semantic search—a groundbreaking approach that leverages natural language processing (NLP) to understand the meaning and context behind queries.

Prerequisite Knowledge

Langchain -> https://python.langchain.com/docs/tutorials/
Milvus -> https://milvus.io/docs/quickstart.md
Ollama -> https://ollama.com/
Python3/Jupyter Notebooks -> https://docs.python.org/3/tutorial/index.html
Vector Embeddings -> https://milvus.io/intro

Traditional Search Example

Imagine you have a database of articles and you search for the term “dog training tips.”

Query: “dog training tips”
Results:
1. An article titled “Tips for Training a Dog” which directly matches the search query.
2. A blog post that mentions “dog” in one sentence and “training tips” in another, but they are unrelated.
3. A list of tips that includes the word “dog” but is not specifically about training dogs.

Lexical search matches the exact keywords “dog,” “training,” and “tips,” regardless of their context within the documents.

Semantic Search Example

Query: “dog training tips”
Results:
1. A comprehensive guide on how to train your Goldendoodle with various techniques and tips.
2. An article about positive reinforcement methods for Border Collies.
3. A blog post discussing effective housebreaking strategies for puppies of different breeds.

In this semantic search example, the search engine understands that “dog training tips” involves practical advice and techniques tailored to different dog breeds. It provides results that are contextually relevant and helpful, even though the exact phrase “dog training tips” may not be present in every result.

This demonstrates how semantic search can offer more meaningful and contextually appropriate results by understanding the intent and nuances behind the query, rather than just matching keywords.

Advantages of Semantic Search:

Understanding Context:
- Semantic search understands the context and meaning behind the words, rather than just matching keywords. For example, if you search for “heart attack symptoms,” a semantic search engine can return results about “myocardial infarction” symptoms as well.
Handling Synonyms and Related Terms:
- It recognizes synonyms and related terms, which means it can return relevant results even if the exact keywords are not present in the documents. For instance, searching for “automobile” will also find documents containing “car.”
Improved Relevance:
- Because it understands the meaning of the query, semantic search can provide more relevant and accurate results. This can improve the user experience significantly.
Natural Language Processing:
- Semantic search can handle natural language queries better. Users can ask questions in a conversational manner, and the search engine can understand and provide appropriate responses.
Ranking by Meaning:
- Results are ranked based on their relevance to the meaning of the query, rather than just the presence of keywords. This helps in finding the most meaningful and useful information quickly.

Advantages of Traditional Lexical Search:

Speed:
- Lexical searches are often faster because they rely on simple keyword matching and indexing.
Simplicity:
- It’s straightforward and easy to implement, especially for small datasets or simple search tasks.
Exact Match:
- If the exact term or phrase is important, lexical search can be more precise in returning exact matches.

Both methods have their own strengths, but semantic search is particularly powerful for understanding and retrieving information based on meaning and context. This can be extremely useful for complex queries and large datasets.

Creating a Semantic Search Tool

Let’s create a new Jupyter Notebook to demonstrate how to do text-to-text semantic search using Python3, Langchain, and Milvus. This setup is perfect for diving into the nuances of semantic search and how it can significantly improve the way we retrieve information.

Ollama setup

Before starting this semantic search project, you need to properly set up Ollama on your local system. First, download and install Ollama from the official website, which is available for MacOS, Windows, and Linux. After installation, you’ll need to download the nomic-embed-text and LLaVA models.

ollama pull nomic-embed-text
ollama pull llava

Downloading A Book For Semantic Search

To test our semantic search, we’ll use 'The Fellowship of the Ring' as our sample text. You can download a text version of this book from the Internet Archive. Once downloaded, extract the contents of the file if it is compressed.

macOS

Using Archive Utility:

Right-click the .gz file.
Select Open With > Archive Utility.

Using Terminal:

Open Terminal from Applications > Utilities.
Navigate to the directory containing the .gz file using cd command.
Run gunzip j-r-r-tolkien-lord-of-the-rings-01-the-fellowship-of-the-ring-retail-pdf_hocr_searchtext.txt.gz to extract the file.

Linux

Using Terminal:

Open Terminal .
Navigate to the directory containing the .gz file using cd command.
Run gunzip j-r-r-tolkien-lord-of-the-rings-01-the-fellowship-of-the-ring-retail-pdf_hocr_searchtext.txt.gz to extract the file.

Using GUI:

Right-click the .gz file.
Select Extract Here or Extract to.

Windows

Using 7-Zip:

Download and install 7-Zip.
Right-click the .gz file Gz File in Linux.
Select 7-Zip > Extract Here.

Once the file has been extracted, move it to the same directory as your Jupyter Notebook and rename it to fellowship-of-the-ring.txt.

Setting up the environment with required libraries

Next, we need to install the required packages.

!pip install -qU langchain langchain_milvus langchain-ollama ollama

This command ensures that the latest versions of these packages are installed in your Python environment, which are necessary for the rest of the code in the notebook to run properly.

Import Necessary Libraries

# Import and initialize Ollama embeddings model
from langchain_ollama import OllamaEmbeddings
from langchain_milvus import Milvus
from langchain_community.document_loaders import DirectoryLoader

Loading The Document

loader = TextLoader("./fellowship-of-the-ring.txt")
text_documents = loader.load()

This code snippet loads the content of a text file named “fellowship-of-the-ring.txt” into memory for further processing. It uses the TextLoader class from the Langchain library to create a loader object specifically designed to handle text files. The load() method is then called on this loader object, which reads the entire contents of the specified file and returns it as a list of document objects. Each document object typically contains the text content along with any metadata associated with the file. In this case, the loaded text, presumably the content of J.R.R. Tolkien’s “The Fellowship of the Ring,” is stored in the text_documents variable, making it available for subsequent operations such as text splitting, embedding generation, or semantic analysis.

Splitting The Text

from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
documents = text_splitter.split_documents(text_documents)

A text splitter is like a document chopper – it breaks down long texts into smaller, manageable pieces (chunks). The RecursiveCharacterTextSplitter creates chunks of text with these properties:

Each chunk is approximately 1000 characters long (chunk_size=1000)
Each chunk overlaps with the next chunk by 200 characters (chunk_overlap=200)

Why Use Overlapping Chunks?
The overlap ensures that sentences or concepts that might be split between chunks aren’t lost. For example:

Original text: "Gandalf the Grey was a powerful wizard who helped the hobbits."
Chunk 1: "Gandalf the Grey was a powerful"
Chunk 2: "was a powerful wizard who helped"
Chunk 3: "wizard who helped the hobbits."

The overlap (shown in the repeated words) helps maintain context and prevents important information from being cut off at chunk boundaries. This is particularly important for semantic search, as it ensures that related concepts stay together even when the text is split.

The final line documents = text_splitter.split_documents(text_documents) applies this splitting process to your loaded text, creating a list of smaller document chunks that are easier to process and analyze.

Vector Database Setup

embeddings = OllamaEmbeddings(model="nomic-embed-text")

# Local Milvus Lite Instance
URI="./semantic-searches.db"

# Init vector store
vector_store = Milvus(
    embedding_function=embeddings,
    connection_args={"uri":URI},
    auto_id=True,
)

This code sets up the core components for turning text into searchable vectors. Let’s break it down:

Embedding Creation

embeddings = OllamaEmbeddings(model="nomic-embed-text")

This line creates an embedding function using the “nomic-embed-text” model from Ollama. Think of this as a translator that converts words and sentences into numbers (vectors) that a computer can understand and compare.

Database Setup

URI="./semantic-searches.db"

This specifies where the vector database will be stored on your computer.

Vector Store Initialization

vector_store = Milvus(
    embedding_function=embeddings,
    connection_args={"uri":URI},
    auto_id=True,
)

This creates a new vector store using Milvus, which is like a special filing cabinet for vectors. It’s set up with:

The embedding function we created earlier to convert text to vectors
The location where it should store its data (the URI we defined)
auto_id=True, which means it will automatically assign unique IDs to each vector it stores

Upload The Documents To The Vector Store

vector_store.add_documents(documents)

This line takes all the previously split text chunks (documents) and adds them to the Milvus vector store by converting each chunk into vector embeddings and storing them in the database for later searching.

Semantic Search

# The following code performs similarity search
results = vector_store.similarity_search_with_score(
    "Wizard", k=3,
)

for res, score in results:
    print(f"* [SIM={score:3f}] {res.page_content}")
    print('-----------')

This code performs a semantic search for the word “Wizard” in the stored text, retrieving the 3 most similar passages (k=3), and then prints each result with its similarity score. The loop displays each matching text passage along with a numerical score showing how closely it matches the search term:

The interpretation of similarity scores depends on the specific similarity metric being used, but generally:

For cosine similarity and dot product:

A score closer to 1.0 indicates higher similarity
A score closer to 0 indicates lower similarity

For distance-based metrics (like Euclidean distance):

Larger scores indicate less similarity
A score closer to 0 indicates higher similarity

When creating a collection in Milvus, the default metric type is "COSINE" if not explicitly specified.

Bringing It All Together

Semantic search takes text analysis to a whole new level, moving beyond simple keyword matching. With just a few lines of code using Langchain, Ollama, and Milvus, we’ve built a powerful search engine that understands the meaning behind words. Our implementation can process entire books, break them into manageable chunks, and find relevant passages based on their semantic meaning rather than exact word matches. Semantic search has numerous practical applications across various industries, such as:

Financial Applications

Real-time analysis of market reports and financial documents
Cross-referencing financial metrics across multiple sources
Identifying market trends and patterns in news articles

Enterprise Knowledge Management

Centralized access to corporate documentation
Intelligent document retrieval across departments
Automated knowledge base organization and search

AI and Machine Learning

Enhanced RAG implementations for chatbots
Contextual recommendation engines
Improved document summarization systems

Full Code

# Import and initialize Ollama embeddings model
from langchain_ollama import OllamaEmbeddings
from langchain_milvus import Milvus
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = TextLoader("./fellowship-of-the-ring.txt")
text_documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
documents = text_splitter.split_documents(text_documents)

embeddings = OllamaEmbeddings(model="nomic-embed-text")

# Local Milvus Lite Instance
URI="./semantic-searches.db"

# Init vector store
vector_store = Milvus(
    embedding_function=embeddings,
    connection_args={"uri":URI},
    auto_id=True,
)


vector_store.add_documents(documents)

# The following code performs similarity search
results = vector_store.similarity_search_with_score(
    "Wizard", k=3,
)

for res, score in results:
    print(f"* [SIM={score:3f}] {res.page_content}")
    print('-----------')

Full Code Without Langchain

from pymilvus import MilvusClient
import ollama
from typing import List, Dict, Optional, Generator

def load_and_split_text(file_path: str, chunk_size: int = 1000, chunk_overlap: int = 200) -> Generator[str, None, None]:
    """Load and split text into chunks using a generator"""
    try:
        with open(file_path, 'r', encoding='utf-8') as file:
            text = file.read(chunk_size)
            while text:
                # Adjust end to avoid splitting words
                while len(text) > chunk_size and not text[chunk_size].isspace():
                    chunk_size -= 1
                chunk = text[:chunk_size].strip()

                # Yield chunk and read next part of file
                yield chunk
                text = text[chunk_size - chunk_overlap:] + file.read(chunk_size - chunk_overlap)
    except Exception as e:
        print(f"Error loading or splitting text: {e}")

def initialize_vector_store() -> Optional[MilvusClient]:
    """Initialize Milvus vector store"""
    try:
        milvus_client = MilvusClient(uri="./no-lang-semantic-searches.db")
        collection_name = "text_collection"

        if milvus_client.has_collection(collection_name):
            milvus_client.drop_collection(collection_name)

        milvus_client.create_collection(
            collection_name=collection_name,
            dimension=768,
            metric_type="COSINE",
            consistency_level="Strong",
            auto_id=True
        )
        return milvus_client
    except Exception as e:
        print(f"Error initializing vector store: {e}")
        return None

def store_text_chunks(milvus_client: MilvusClient, chunks: List[str]) -> None:
    """Store text chunks in vector store"""
    try:
        data = []
        for chunk in chunks:
            embedding_response = ollama.embeddings(
                model='nomic-embed-text',
                prompt=chunk
            )
            embedding = embedding_response['embedding']

            data.append({
                "vector": embedding,
                "content": chunk
            })

        if data:
            milvus_client.insert(
                collection_name="text_collection",
                data=data
            )
            print(f"Successfully stored {len(data)} text chunks")
    except Exception as e:
        print(f"Error storing text chunks: {e}")

def search_similar_text(query: str, milvus_client: MilvusClient, k: int = 3) -> None:
    """Search for similar text passages"""
    try:
        embedding_response = ollama.embeddings(
            model='nomic-embed-text',
            prompt=query
        )
        query_embedding = embedding_response['embedding']

        results = milvus_client.search(
            collection_name="text_collection",
            data=[query_embedding],
            limit=k,
            search_params={
                "metric_type": "COSINE",
                "params": {}
            },
            output_fields=["content"]
        )

        for hit in results[0]:
            print(f"* [SIM={hit['distance']:.3f}] {hit['entity']['content']}")
            print('-----------')

    except Exception as e:
        print(f"Error during text search: {e}")

# Initialize vector store
milvus_client = initialize_vector_store()

# Load and split text
print("Loading and splitting text...")
chunks = load_and_split_text("./fellowship-of-the-ring.txt")

# Store text chunks
print("Storing text chunks...")
store_text_chunks(milvus_client, chunks)

# Search similar text
print("\nSearching for similar text passages...")
search_similar_text("Wizard", milvus_client, k=3)