AI Research: Text To Audio Semantic Search

In this article, we will walk through the creation of a robust text-to-audio search tool using Jupyter Notebook. We’ll leverage state-of-the-art technologies such as LangChain, Whisper, and Milvus to efficiently transcribe audio files and search through them based on text queries.

The text-to-audio search tool operates through a straightforward three-step process:

Audio Transcription: The system first converts audio files into text using OpenAI’s Whisper model, breaking down the audio into timestamped segments.
Vector Embedding: Each transcribed text segment is then converted into numerical vectors using Ollama’s embedding model (nomic-embed-text), which captures the semantic meaning of the content. The numerical vectors are then stored in the Milvus vector database.
Semantic Search: When a user enters a text query, the system converts it to the same vector format and uses Milvus to find similar vectors in the database, returning the most relevant audio segments along with their timestamps and similarity scores.

Prerequisite Knowledge

Langchain -> https://python.langchain.com/docs/tutorials/
Milvus -> https://milvus.io/docs/quickstart.md
Ollama -> https://ollama.com/
Python3/Jupyter Notebooks -> https://docs.python.org/3/tutorial/index.html
Vector Embeddings -> https://milvus.io/intro

Ollama Setup

First, install Ollama on your computer. It’s available for MacOS, Windows, and Linux through the Ollama website. After installation, you’ll need to download the nomic-embed-text model, which will handle the text embedding process essential for the search functionality.

ollama pull nomic-embed-text

Install FFMPEG

The core functionality of our audio processing relies on OpenAI's Whisper, a powerful speech recognition model that converts spoken words into text with high accuracy across multiple languages and accents. Before using Whisper, you’ll need to install ffmpeg, a command-line utility that enables Whisper to process different audio file formats by converting them into compatible formats for transcription.

Installation Methods

You can install ffmpeg using package managers depending on your operating system:

Linux (Ubuntu/Debian)

sudo apt update && sudo apt install ffmpeg

MacOS

brew install ffmpeg

Windows

choco install ffmpeg

Important Notes

FFmpeg must be installed before using Whisper. Without it, you’ll encounter errors when trying to process audio.
The ffmpeg installation command for MacOS requires Homebrew, and Windows requires Chocolatey.
After installing ffmpeg, you may need to restart your development environment or terminal for the changes to take effect.

Installation of Required Libraries

First, we need to install the necessary libraries. Make sure you have the following libraries installed:

%pip install -U langchain langchain-community langchain_milvus langchain-ollama

%pip install git+https://github.com/openai/whisper.git

Importing Libraries

Next, we import the required libraries:

import tempfile 
import whisper 
import ollama 
from langchain_ollama import OllamaEmbeddings 
from langchain_milvus import Milvus 
import os 
from langchain.schema import Document

Transcribing Audio Files with Whisper

We then use Whisper, a powerful automatic speech recognition (ASR) tool by OpenAI, to transcribe audio files into text segments:

def process_audio(audio_path):
    try:
        whisper_model = whisper.load_model("base")
        result = whisper_model.transcribe(audio_path)
        audio_transcriptions = result['segments']

        combined_documents = []
        for segment in audio_transcriptions:
            content = f"Transcript: {segment['text']}\n"
            combined_documents.append(Document(
                page_content=content,
                metadata={'timestamp': segment['start'], 'filename': os.path.basename(audio_path), 'duration': segment['end'] - segment['start']}
            ))
        return combined_documents
    except FileNotFoundError:
        print(f"Error: The file {audio_path} was not found.")
        return
    except Exception as e:
        print(f"An error occurred during transcription: {e}")
        return

The process_audio function uses Whisper to transcribe an audio file, processes the transcriptions into segments, and converts them into Document objects with relevant metadata. It also includes error handling to manage cases where the file is not found or other issues that may arise during transcription.

Loading and Transcribing Audio with Whisper

        whisper_model = whisper.load_model("base")
        result = whisper_model.transcribe(audio_path)
        audio_transcriptions = result['segments']

whisper_model = whisper.load_model("base"): Loads the Whisper model (a base version in this case) to perform automatic speech recognition (ASR).

result = whisper_model.transcribe(audio_path): Transcribes the audio file at the given audio_path.

audio_transcriptions = result['segments']: Extracts the transcription segments from the result. Each segment represents a portion of the audio along with its corresponding transcription.

Creating Document Objects

        combined_documents = []
        for segment in audio_transcriptions:
            content = f"Transcript: {segment['text']}\n"
            combined_documents.append(Document(
                page_content=content,
                metadata={'timestamp': segment['start'], 'filename': os.path.basename(audio_path), 'duration': segment['end'] - segment['start']}
            ))
        return combined_documents

combined_documents = []: Initializes an empty list to store the Document objects.

for segment in audio_transcriptions: Iterates over each transcription segment.

content = f"Transcript: {segment['text']}\n": Formats the transcription text into a string.

combined_documents.append(Document(...)): Creates a Document object for each segment and appends it to the combined_documents list. Each Document includes:

page_content: The transcription text.
metadata: Metadata such as the timestamp, filename, and duration of the segment.

Exception Handling

    except FileNotFoundError:
        print(f"Error: The file {audio_path} was not found.")
        return
    except Exception as e:
        print(f"An error occurred during transcription: {e}")
        return

except FileNotFoundError: Catches the specific error if the file at audio_path is not found and prints an error message.

except Exception as e: Catches any other exceptions that might occur during the process and prints an error message.

Initializing the Vector Store

We initialize a vector store using Milvus, a high-performance vector database for scalable similarity search, with embeddings generated using OllamaEmbeddings:

def initialize_vector_store():
    try:
        embeddings = OllamaEmbeddings(model="nomic-embed-text")
        vector_store = Milvus(
            embedding_function=embeddings,
            connection_args={"uri": "./audio-search.db"},
            auto_id=True,
        )
        return vector_store
    except Exception as e:
        print(f"Error initializing vector store: {e}")
        return None

Main Function

The main function combines the audio processing and vector store initialization, and adds the processed audio documents to the vector store:

def main(audio_path):
    vector_store = initialize_vector_store()
    if not vector_store:
        return

    audio_documents = process_audio(audio_path)
    if audio_documents:
        try:
            vector_store.add_documents(audio_documents)
            print("Successfully processed audio and stored embeddings")
        except Exception as e:
            print(f"Error storing documents in vector store: {e}")

Finding Audio Files

Before diving into the functionality of your text-to-video search tool, you’ll need an audio file to work with. This will allow you to test the transcription and search capabilities effectively.

Search for Audio Content: Use search engines or audio libraries like SoundCloud, Free Music Archive, or even YouTube to find the audio files you need.

Convert Video to Audio: If you find a video with the audio content you need, you can use an online mp4 to mp3 converter to extract the audio.

Organize Files in Jupyter Notebook Directory: Once you have your audio file (whether it’s directly downloaded or converted from a video), place it in the same directory as your Jupyter notebook. This makes it easier to access and process the file within your notebook.

Rename File: Make sure the audio file has a name that matches the filename used in your code. For example, if your code refers to condo-tour.mp3, rename your file accordingly to avoid any errors during processing.

Calling Main

Now we can call the main function to transcribe and embed our audio files:

main("condo-tour.mp3") 
main("news.wav")

Performing A Similarity Search

Finally, we perform a similarity search using the text query “Inside” to find the most relevant segments in the audio files:

embeddings = OllamaEmbeddings(model="nomic-embed-text")
vector_store = Milvus(
    embedding_function=embeddings,
    connection_args={"uri": "./audio-search.db"},
    auto_id=True,
)

# Search with similarity scores
results = vector_store.similarity_search_with_score(
    "Inside",
    k=5
)

for doc, score in results:
    print(f"Similarity Score: {score:.3f}")
    print(f"---Content---\n")
    print(f"Filename: {doc.metadata['filename']}")
    print(f"Duration: {doc.metadata['duration']} seconds")
    print(f"Timestamp: {doc.metadata['timestamp']} seconds")
    print(f"{doc.page_content}")
    print("-------------------")

The provided code establishes a process to locate and rank relevant audio segments based on a text query. Initially, we set up the embeddings and vector store by creating an OllamaEmbeddings object using the nomic-embed-text model. This model transforms text data into numerical vectors, facilitating similarity searches. The Milvus object is then initialized with the embedding function and connection specifics, including the URI for the database and automatic ID generation for stored documents.

embeddings = OllamaEmbeddings(model="nomic-embed-text")
vector_store = Milvus(
    embedding_function=embeddings,
    connection_args={"uri": "./audio-search.db"},
    auto_id=True,
)

Next, the similarity search process is initiated. We use the similarity_search_with_score method to search for documents that are similar to the query “Inside.” This method retrieves the top five most similar documents, where the query term is “Inside” and the parameter k=5 specifies that the top five matches should be returned.

# Search with similarity scores
results = vector_store.similarity_search_with_score(
    "Inside",
    k=5
)

Finally, the results are displayed in a loop that iterates over the search results. For each result, the similarity score and document content are printed. The similarity score indicates how closely the document matches the query. The document’s metadata and transcribed content are also displayed, including the filename, duration, timestamp, and the transcribed text of the audio segment.

for doc, score in results:
    print(f"Similarity Score: {score:.3f}")
    print(f"---Content---\n")
    print(f"Filename: {doc.metadata['filename']}")
    print(f"Duration: {doc.metadata['duration']} seconds")
    print(f"Timestamp: {doc.metadata['timestamp']} seconds")
    print(f"{doc.page_content}")
    print("-------------------")

This setup effectively supports transcription, storage, and search of audio content based on text queries, making it a robust tool for handling large sets of audio data.

Conclusion

In this article, we explored the creation of a powerful text-to-audio search tool using Jupyter Notebook. We delved into the integration of advanced technologies such as LangChain, Whisper, OllamaEmbeddings, and Milvus to transcribe audio files and perform efficient similarity searches based on text queries.

We began by installing the necessary libraries and importing the required modules. Next, we discussed the process of transcribing audio files using Whisper and converting the transcriptions into documents. We then set up a vector store using Milvus to store the embeddings generated by OllamaEmbeddings, ensuring that our data was organized and searchable.

To demonstrate the capabilities of our tool, we walked through the process of performing a similarity search, retrieving and displaying relevant audio segments based on a given text query. This comprehensive setup highlights the practical applications of combining speech recognition, embeddings, and vector databases to manage and search large sets of audio data effectively.

By following the steps outlined in this article, you can implement your own text-to-video search tool, tailored to your specific needs. This approach provides a scalable and efficient solution for organizing and retrieving information from audio content, making it an invaluable asset for various applications.

Full Code

import tempfile 
import whisper 
import ollama 
from langchain_ollama import OllamaEmbeddings 
from langchain_milvus import Milvus 
import os 
from langchain.schema import Document

def process_audio(audio_path):
    try:
        whisper_model = whisper.load_model("base")
        result = whisper_model.transcribe(audio_path)
        audio_transcriptions = result['segments']

        combined_documents = []
        for segment in audio_transcriptions:
            content = f"Transcript: {segment['text']}\n"
            combined_documents.append(Document(
                page_content=content,
                metadata={'timestamp': segment['start'], 'filename': os.path.basename(audio_path), 'duration': segment['end'] - segment['start']}
            ))
        return combined_documents
    except FileNotFoundError:
        print(f"Error: The file {audio_path} was not found.")
        return
    except Exception as e:
        print(f"An error occurred during transcription: {e}")
        return
def initialize_vector_store():
    try:
        embeddings = OllamaEmbeddings(model="nomic-embed-text")
        vector_store = Milvus(
            embedding_function=embeddings,
            connection_args={"uri": "./audio-search.db"},
            auto_id=True,
        )
        return vector_store
    except Exception as e:
        print(f"Error initializing vector store: {e}")
        return None

def main(audio_path):
    vector_store = initialize_vector_store()
    if not vector_store:
        return

    audio_documents = process_audio(audio_path)
    if audio_documents:
        try:
            vector_store.add_documents(audio_documents)
            print("Successfully processed audio and stored embeddings")
        except Exception as e:
            print(f"Error storing documents in vector store: {e}")

main("condo-tour.mp3")
main("news.wav")

embeddings = OllamaEmbeddings(model="nomic-embed-text")
vector_store = Milvus(
            embedding_function=embeddings,
            connection_args={"uri": "./audio-search.db"},
            auto_id=True,
        )

# Search with similarity scores
results = vector_store.similarity_search_with_score(
    "Inside",
    k=5
)

for doc, score in results:
    print(f"Similarity Score: {score:.3f}")
    print(f"---Content---\n")
    print(f"Filename: {doc.metadata['filename']}")
    print(f"Duration: {doc.metadata['duration']} seconds")
    print(f"Timestamp: {doc.metadata['timestamp']} seconds")
    print(f"{doc.page_content}")
    print("-------------------")

Full Code Without Langchain

import tempfile
import whisper
import ollama
from pymilvus import MilvusClient
import os
from typing import List, Dict

def process_audio(audio_path: str) -> List[Dict]:
    try:
        whisper_model = whisper.load_model("base")
        result = whisper_model.transcribe(audio_path)
        audio_transcriptions = result['segments']

        combined_documents = []
        for segment in audio_transcriptions:
            content = f"Transcript: {segment['text']}\n"
            combined_documents.append({
                'content': content,
                'metadata': {'timestamp': segment['start'], 'filename': os.path.basename(audio_path), 'duration': segment['end'] - segment['start']}
            })
        return combined_documents
    except FileNotFoundError:
        print(f"Error: The file {audio_path} was not found.")
        return []
    except Exception as e:
        print(f"An error occurred during transcription: {e}")
        return []

def initialize_vector_store():
    try:
        milvus_client = MilvusClient(uri="./no-lang-audio.db")
        collection_name = "audio_collection"

        # Drop existing collection if it exists
        if milvus_client.has_collection(collection_name):
            milvus_client.drop_collection(collection_name)

        # Create new collection with auto_id enabled
        milvus_client.create_collection(
            collection_name=collection_name,
            dimension=768,
            metric_type="IP",
            consistency_level="Strong",
            auto_id=True  # Enable auto_id
        )
        return milvus_client
    except Exception as e:
        print(f"Error initializing vector store: {e}")
        return None

def add_documents_to_vector_store(milvus_client, documents: List[Dict]):
    try:
        data = []
        for doc in documents:  # Removed the enumerate since we don't need manual ids
            response = ollama.embeddings(
                model='nomic-embed-text',
                prompt=doc['content']
            )
            data.append({
                "vector": response['embedding'],  # Changed from 'embedding' to 'vector'
                "content": doc['content'],
                "timestamp": doc['metadata']['timestamp'], 
                "filename": doc['metadata']['filename'],
                "duration": doc['metadata']['duration']
            })
        if len(data) > 0:
            milvus_client.insert(
                collection_name="audio_collection",
                data=data
            )
            print("Successfully processed audio and stored embeddings")
    except Exception as e:
        print(f"Error storing documents in vector store: {e}")

def main(audio_path: str):
    milvus_client = initialize_vector_store()
    if milvus_client is None:
        return
    
    audio_documents = process_audio(audio_path)
    if audio_documents:
        add_documents_to_vector_store(milvus_client, audio_documents)

def search_similar(query: str, k: int = 5):
    milvus_client = MilvusClient(uri="./no-lang-audio.db")
    
    response = ollama.embeddings(
        model='nomic-embed-text',
        prompt=query
    )
    query_embedding = response['embedding']
    
    results = milvus_client.search(
        collection_name="audio_collection",
        data=[query_embedding],
        limit=k,
        search_params={
            "metric_type": "IP",
            "params": {}
        },
        output_fields=["content", "timestamp", "filename", "duration"]
    )
    
    for hit in results[0]:
        print(f"Similarity Score: {hit['distance']:.3f}")
        print(f"---Content---\n")
        print(f"Filename: {hit['entity']['filename']}")
        print(f"Duration: {hit['entity']['duration']} seconds")
        print(f"Timestamp: {hit['entity']['timestamp']} seconds")
        print(f"{hit['entity']['content']}")
        print("-------------------")

# Embed and store documents
main("condo-tour.mp3")
main("news.wav")

# Example search
search_similar("Fireworks", k=5)