In this article, we will walk through the creation of a robust text-to-audio search tool using Jupyter Notebook. We’ll leverage state-of-the-art technologies such as LangChain, Whisper, and Milvus to efficiently transcribe audio files and search through them based on text queries.
The text-to-audio search tool operates through a straightforward three-step process:
- Audio Transcription: The system first converts audio files into text using OpenAI’s Whisper model, breaking down the audio into timestamped segments.
- Vector Embedding: Each transcribed text segment is then converted into numerical vectors using Ollama’s embedding model (nomic-embed-text), which captures the semantic meaning of the content. The numerical vectors are then stored in the Milvus vector database.
- Semantic Search: When a user enters a text query, the system converts it to the same vector format and uses Milvus to find similar vectors in the database, returning the most relevant audio segments along with their timestamps and similarity scores.
Prerequisite Knowledge
- Langchain -> https://python.langchain.com/docs/tutorials/
- Milvus -> https://milvus.io/docs/quickstart.md
- Ollama -> https://ollama.com/
- Python3/Jupyter Notebooks -> https://docs.python.org/3/tutorial/index.html
- Vector Embeddings -> https://milvus.io/intro
Ollama Setup
First, install Ollama on your computer. It’s available for MacOS, Windows, and Linux through the Ollama website. After installation, you’ll need to download the nomic-embed-text
model, which will handle the text embedding process essential for the search functionality.
ollama pull nomic-embed-text
Install FFMPEG
The core functionality of our audio processing relies on OpenAI's Whisper
, a powerful speech recognition model that converts spoken words into text with high accuracy across multiple languages and accents. Before using Whisper, you’ll need to install ffmpeg
, a command-line utility that enables Whisper to process different audio file formats by converting them into compatible formats for transcription.
Installation Methods
You can install ffmpeg using package managers depending on your operating system:
Linux (Ubuntu/Debian)
sudo apt update && sudo apt install ffmpeg
MacOS
brew install ffmpeg
Windows
choco install ffmpeg
Important Notes
- FFmpeg must be installed before using Whisper. Without it, you’ll encounter errors when trying to process audio.
- The ffmpeg installation command for MacOS requires Homebrew, and Windows requires Chocolatey.
- After installing ffmpeg, you may need to restart your development environment or terminal for the changes to take effect.
Installation of Required Libraries
First, we need to install the necessary libraries. Make sure you have the following libraries installed:
%pip install -U langchain langchain-community langchain_milvus langchain-ollama
%pip install git+https://github.com/openai/whisper.git
Importing Libraries
Next, we import the required libraries:
import tempfile
import whisper
import ollama
from langchain_ollama import OllamaEmbeddings
from langchain_milvus import Milvus
import os
from langchain.schema import Document
Transcribing Audio Files with Whisper
We then use Whisper, a powerful automatic speech recognition (ASR) tool by OpenAI, to transcribe audio files into text segments:
def process_audio(audio_path):
try:
whisper_model = whisper.load_model("base")
result = whisper_model.transcribe(audio_path)
audio_transcriptions = result['segments']
combined_documents = []
for segment in audio_transcriptions:
content = f"Transcript: {segment['text']}\n"
combined_documents.append(Document(
page_content=content,
metadata={'timestamp': segment['start'], 'filename': os.path.basename(audio_path), 'duration': segment['end'] - segment['start']}
))
return combined_documents
except FileNotFoundError:
print(f"Error: The file {audio_path} was not found.")
return
except Exception as e:
print(f"An error occurred during transcription: {e}")
return
The process_audio
function uses Whisper to transcribe an audio file, processes the transcriptions into segments, and converts them into Document objects with relevant metadata. It also includes error handling to manage cases where the file is not found or other issues that may arise during transcription.
Loading and Transcribing Audio with Whisper
whisper_model = whisper.load_model("base")
result = whisper_model.transcribe(audio_path)
audio_transcriptions = result['segments']
whisper_model = whisper.load_model("base")
: Loads the Whisper model (a base version in this case) to perform automatic speech recognition (ASR).
result = whisper_model.transcribe(audio_path)
: Transcribes the audio file at the given audio_path
.
audio_transcriptions = result['segments']
: Extracts the transcription segments from the result. Each segment represents a portion of the audio along with its corresponding transcription.
Creating Document Objects
combined_documents = []
for segment in audio_transcriptions:
content = f"Transcript: {segment['text']}\n"
combined_documents.append(Document(
page_content=content,
metadata={'timestamp': segment['start'], 'filename': os.path.basename(audio_path), 'duration': segment['end'] - segment['start']}
))
return combined_documents
combined_documents = []
: Initializes an empty list to store the Document objects.
for segment in audio_transcriptions
: Iterates over each transcription segment.
content = f"Transcript: {segment['text']}\n"
: Formats the transcription text into a string.
combined_documents.append(Document(...))
: Creates a Document
object for each segment and appends it to the combined_documents
list. Each Document includes:
page_content
: The transcription text.metadata
: Metadata such as the timestamp, filename, and duration of the segment.
Exception Handling
except FileNotFoundError:
print(f"Error: The file {audio_path} was not found.")
return
except Exception as e:
print(f"An error occurred during transcription: {e}")
return
except FileNotFoundError
: Catches the specific error if the file at audio_path
is not found and prints an error message.
except Exception as e
: Catches any other exceptions that might occur during the process and prints an error message.
Initializing the Vector Store
We initialize a vector store using Milvus, a high-performance vector database for scalable similarity search, with embeddings generated using OllamaEmbeddings:
def initialize_vector_store():
try:
embeddings = OllamaEmbeddings(model="nomic-embed-text")
vector_store = Milvus(
embedding_function=embeddings,
connection_args={"uri": "./audio-search.db"},
auto_id=True,
)
return vector_store
except Exception as e:
print(f"Error initializing vector store: {e}")
return None
Main Function
The main
function combines the audio processing and vector store initialization, and adds the processed audio documents to the vector store:
def main(audio_path):
vector_store = initialize_vector_store()
if not vector_store:
return
audio_documents = process_audio(audio_path)
if audio_documents:
try:
vector_store.add_documents(audio_documents)
print("Successfully processed audio and stored embeddings")
except Exception as e:
print(f"Error storing documents in vector store: {e}")
Finding Audio Files
Before diving into the functionality of your text-to-video search tool, you’ll need an audio file to work with. This will allow you to test the transcription and search capabilities effectively.
Search for Audio Content: Use search engines or audio libraries like SoundCloud, Free Music Archive, or even YouTube to find the audio files you need.
Convert Video to Audio: If you find a video with the audio content you need, you can use an online mp4 to mp3 converter to extract the audio.
Organize Files in Jupyter Notebook Directory: Once you have your audio file (whether it’s directly downloaded or converted from a video), place it in the same directory as your Jupyter notebook. This makes it easier to access and process the file within your notebook.
Rename File: Make sure the audio file has a name that matches the filename used in your code. For example, if your code refers to condo-tour.mp3
, rename your file accordingly to avoid any errors during processing.
Calling Main
Now we can call the main function to transcribe and embed our audio files:
main("condo-tour.mp3")
main("news.wav")
Performing A Similarity Search
Finally, we perform a similarity search using the text query “Inside” to find the most relevant segments in the audio files:
embeddings = OllamaEmbeddings(model="nomic-embed-text")
vector_store = Milvus(
embedding_function=embeddings,
connection_args={"uri": "./audio-search.db"},
auto_id=True,
)
# Search with similarity scores
results = vector_store.similarity_search_with_score(
"Inside",
k=5
)
for doc, score in results:
print(f"Similarity Score: {score:.3f}")
print(f"---Content---\n")
print(f"Filename: {doc.metadata['filename']}")
print(f"Duration: {doc.metadata['duration']} seconds")
print(f"Timestamp: {doc.metadata['timestamp']} seconds")
print(f"{doc.page_content}")
print("-------------------")
The provided code establishes a process to locate and rank relevant audio segments based on a text query. Initially, we set up the embeddings and vector store by creating an OllamaEmbeddings
object using the nomic-embed-text
model. This model transforms text data into numerical vectors, facilitating similarity searches. The Milvus
object is then initialized with the embedding function and connection specifics, including the URI for the database and automatic ID generation for stored documents.
embeddings = OllamaEmbeddings(model="nomic-embed-text")
vector_store = Milvus(
embedding_function=embeddings,
connection_args={"uri": "./audio-search.db"},
auto_id=True,
)
Next, the similarity search process is initiated. We use the similarity_search_with_score
method to search for documents that are similar to the query “Inside.” This method retrieves the top five most similar documents, where the query term is “Inside” and the parameter k=5
specifies that the top five matches should be returned.
# Search with similarity scores
results = vector_store.similarity_search_with_score(
"Inside",
k=5
)
Finally, the results are displayed in a loop that iterates over the search results. For each result, the similarity score and document content are printed. The similarity score indicates how closely the document matches the query. The document’s metadata and transcribed content are also displayed, including the filename, duration, timestamp, and the transcribed text of the audio segment.
for doc, score in results:
print(f"Similarity Score: {score:.3f}")
print(f"---Content---\n")
print(f"Filename: {doc.metadata['filename']}")
print(f"Duration: {doc.metadata['duration']} seconds")
print(f"Timestamp: {doc.metadata['timestamp']} seconds")
print(f"{doc.page_content}")
print("-------------------")
This setup effectively supports transcription, storage, and search of audio content based on text queries, making it a robust tool for handling large sets of audio data.
Conclusion
In this article, we explored the creation of a powerful text-to-audio search tool using Jupyter Notebook. We delved into the integration of advanced technologies such as LangChain, Whisper, OllamaEmbeddings, and Milvus to transcribe audio files and perform efficient similarity searches based on text queries.
We began by installing the necessary libraries and importing the required modules. Next, we discussed the process of transcribing audio files using Whisper and converting the transcriptions into documents. We then set up a vector store using Milvus to store the embeddings generated by OllamaEmbeddings, ensuring that our data was organized and searchable.
To demonstrate the capabilities of our tool, we walked through the process of performing a similarity search, retrieving and displaying relevant audio segments based on a given text query. This comprehensive setup highlights the practical applications of combining speech recognition, embeddings, and vector databases to manage and search large sets of audio data effectively.
By following the steps outlined in this article, you can implement your own text-to-video search tool, tailored to your specific needs. This approach provides a scalable and efficient solution for organizing and retrieving information from audio content, making it an invaluable asset for various applications.
Full Code
import tempfile
import whisper
import ollama
from langchain_ollama import OllamaEmbeddings
from langchain_milvus import Milvus
import os
from langchain.schema import Document
def process_audio(audio_path):
try:
whisper_model = whisper.load_model("base")
result = whisper_model.transcribe(audio_path)
audio_transcriptions = result['segments']
combined_documents = []
for segment in audio_transcriptions:
content = f"Transcript: {segment['text']}\n"
combined_documents.append(Document(
page_content=content,
metadata={'timestamp': segment['start'], 'filename': os.path.basename(audio_path), 'duration': segment['end'] - segment['start']}
))
return combined_documents
except FileNotFoundError:
print(f"Error: The file {audio_path} was not found.")
return
except Exception as e:
print(f"An error occurred during transcription: {e}")
return
def initialize_vector_store():
try:
embeddings = OllamaEmbeddings(model="nomic-embed-text")
vector_store = Milvus(
embedding_function=embeddings,
connection_args={"uri": "./audio-search.db"},
auto_id=True,
)
return vector_store
except Exception as e:
print(f"Error initializing vector store: {e}")
return None
def main(audio_path):
vector_store = initialize_vector_store()
if not vector_store:
return
audio_documents = process_audio(audio_path)
if audio_documents:
try:
vector_store.add_documents(audio_documents)
print("Successfully processed audio and stored embeddings")
except Exception as e:
print(f"Error storing documents in vector store: {e}")
main("condo-tour.mp3")
main("news.wav")
embeddings = OllamaEmbeddings(model="nomic-embed-text")
vector_store = Milvus(
embedding_function=embeddings,
connection_args={"uri": "./audio-search.db"},
auto_id=True,
)
# Search with similarity scores
results = vector_store.similarity_search_with_score(
"Inside",
k=5
)
for doc, score in results:
print(f"Similarity Score: {score:.3f}")
print(f"---Content---\n")
print(f"Filename: {doc.metadata['filename']}")
print(f"Duration: {doc.metadata['duration']} seconds")
print(f"Timestamp: {doc.metadata['timestamp']} seconds")
print(f"{doc.page_content}")
print("-------------------")
Full Code Without Langchain
import tempfile
import whisper
import ollama
from pymilvus import MilvusClient
import os
from typing import List, Dict
def process_audio(audio_path: str) -> List[Dict]:
try:
whisper_model = whisper.load_model("base")
result = whisper_model.transcribe(audio_path)
audio_transcriptions = result['segments']
combined_documents = []
for segment in audio_transcriptions:
content = f"Transcript: {segment['text']}\n"
combined_documents.append({
'content': content,
'metadata': {'timestamp': segment['start'], 'filename': os.path.basename(audio_path), 'duration': segment['end'] - segment['start']}
})
return combined_documents
except FileNotFoundError:
print(f"Error: The file {audio_path} was not found.")
return []
except Exception as e:
print(f"An error occurred during transcription: {e}")
return []
def initialize_vector_store():
try:
milvus_client = MilvusClient(uri="./no-lang-audio.db")
collection_name = "audio_collection"
# Drop existing collection if it exists
if milvus_client.has_collection(collection_name):
milvus_client.drop_collection(collection_name)
# Create new collection with auto_id enabled
milvus_client.create_collection(
collection_name=collection_name,
dimension=768,
metric_type="IP",
consistency_level="Strong",
auto_id=True # Enable auto_id
)
return milvus_client
except Exception as e:
print(f"Error initializing vector store: {e}")
return None
def add_documents_to_vector_store(milvus_client, documents: List[Dict]):
try:
data = []
for doc in documents: # Removed the enumerate since we don't need manual ids
response = ollama.embeddings(
model='nomic-embed-text',
prompt=doc['content']
)
data.append({
"vector": response['embedding'], # Changed from 'embedding' to 'vector'
"content": doc['content'],
"timestamp": doc['metadata']['timestamp'],
"filename": doc['metadata']['filename'],
"duration": doc['metadata']['duration']
})
if len(data) > 0:
milvus_client.insert(
collection_name="audio_collection",
data=data
)
print("Successfully processed audio and stored embeddings")
except Exception as e:
print(f"Error storing documents in vector store: {e}")
def main(audio_path: str):
milvus_client = initialize_vector_store()
if milvus_client is None:
return
audio_documents = process_audio(audio_path)
if audio_documents:
add_documents_to_vector_store(milvus_client, audio_documents)
def search_similar(query: str, k: int = 5):
milvus_client = MilvusClient(uri="./no-lang-audio.db")
response = ollama.embeddings(
model='nomic-embed-text',
prompt=query
)
query_embedding = response['embedding']
results = milvus_client.search(
collection_name="audio_collection",
data=[query_embedding],
limit=k,
search_params={
"metric_type": "IP",
"params": {}
},
output_fields=["content", "timestamp", "filename", "duration"]
)
for hit in results[0]:
print(f"Similarity Score: {hit['distance']:.3f}")
print(f"---Content---\n")
print(f"Filename: {hit['entity']['filename']}")
print(f"Duration: {hit['entity']['duration']} seconds")
print(f"Timestamp: {hit['entity']['timestamp']} seconds")
print(f"{hit['entity']['content']}")
print("-------------------")
# Embed and store documents
main("condo-tour.mp3")
main("news.wav")
# Example search
search_similar("Fireworks", k=5)
No responses yet