9 min readSep 21, 2025
–
Press enter or click to view image in full size
AI is shifting fast. The best systems today aren’t just text-based anymore: they’re multimodal. They combine vision, speech, and language into one seamless interface that feels far more natural to humans.
Think about it. OpenAI’s GPT-4o made headlines with real-time voice conversations and image understanding. Google’s Gemini 2.5 added configurable thinking time and topped multimodal benchmarks. Even Claude 3.5 now interprets images and diagrams.
Multimodal AI isn’t an experiment on the fringes anymore. It’s quickly becoming the baseline. If you’re building AI products, tools, or agents, you can’t just think in text. You need to start designing systems that can see, listen, and respond across multipl…
9 min readSep 21, 2025
–
Press enter or click to view image in full size
AI is shifting fast. The best systems today aren’t just text-based anymore: they’re multimodal. They combine vision, speech, and language into one seamless interface that feels far more natural to humans.
Think about it. OpenAI’s GPT-4o made headlines with real-time voice conversations and image understanding. Google’s Gemini 2.5 added configurable thinking time and topped multimodal benchmarks. Even Claude 3.5 now interprets images and diagrams.
Multimodal AI isn’t an experiment on the fringes anymore. It’s quickly becoming the baseline. If you’re building AI products, tools, or agents, you can’t just think in text. You need to start designing systems that can see, listen, and respond across multiple channels.
Of course, there’s a catch. High-end commercial models are often closed source and expensive. But here’s the good news: the open-source ecosystem has exploded. You can now build your own multimodal AI assistant using free, widely available tools.
That’s exactly what this tutorial is about. I’ll walk you through creating a lightweight assistant in Python that can:
- Listen to a voice recording and transcribe it with open-source speech recognition (faster-whisper).
- Look at an image and describe it with image captioning AI (BLIP).
- Combine both inputs and generate a natural-sounding reply with a small language model.
Press enter or click to view image in full size
We’re keeping this project lean, practical, and modular, powered by the open-source ecosystem at Hugging Face.
By the end of this guide, you’ll have a working prototype of an AI that sees and hears. It runs locally, can be integrated into apps, and gives you hands-on experience with the same building blocks powering the multimodal revolution.
Step 1. Set up your environment
Let’s get the basics out of the way (using Powershell).
# Create a new folder for the projectmkdir multimodal-assistant# Move into the project foldercd multimodal-assistant# Create a Python virtual environment in a folder called ".env"python -m venv .env# Activate the virtual environment (this makes sure pip installs go inside .env).env\Scripts\Activate.ps1# Install PyTorch and Torchvision (core libraries for deep learning in Python)pip install torch torchvision# Install Hugging Face Transformers (for BLIP, GPT-2, and other models)pip install transformers# Install faster-whisper (for speech-to-text transcription)pip install faster-whisper# Install Pillow (Python Imaging Library, required for handling images with BLIP)pip install pillow# Create a new empty file called assistant.py where we’ll add our codeNew-Item assistant.py -ItemType File
Step 2. Make your assistant listen
Here’s a function that loads Whisper and transcribes any audio file you give it.
# Import the optimized Whisper implementationfrom faster_whisper import WhisperModeldef transcribe_audio(audio_path: str, model_name: str = "base") -> str: """ Transcribe speech from an audio file into text using faster-whisper. :param audio_path: Path to the audio file (e.g., .wav, .mp3, .m4a) :param model_name: Which Whisper model size to use (tiny, base, small, medium, large-v3) :return: A string containing the transcribed text """ # Load the Whisper model # "device" can be "cpu" for most systems, or "cuda" if you have an NVIDIA GPU configured model = WhisperModel(model_name, device="cpu") # Run the transcription process # segments = generator that yields text chunks with timestamps # info = metadata about the transcription (language, duration, etc.) segments, info = model.transcribe(audio_path) # Collect all the text segments into a single string # Strip removes extra spaces around each chunk return " ".join(s.text.strip() for s in segments).strip()
Try it out:
# Exampletranscript = transcribe_audio("files/Recording.m4a") # Replace with the location of your audio file.print("Transcript:", transcript)
It supports WAV, MP3, M4A and a bunch of other formats. It even figures out the spoken language automatically.
Step 3. Make it see
Now we give our assistant eyes. BLIP generates smart captions for any image.
# Import BLIP (Bootstrapped Language-Image Pretraining) tools from Hugging Face# BlipProcessor handles preprocessing (turning images into tensors + tokenization)# BlipForConditionalGeneration is the actual image captioning modelfrom transformers import BlipProcessor, BlipForConditionalGeneration# PIL (Python Imaging Library) is used to load and handle imagesfrom PIL import Imagedef caption_image(image_path: str) -> str: """ Generate a descriptive caption for an image using BLIP. :param image_path: Path to the image file (jpg, png, etc.) :return: A string caption describing the image """ # Load the BLIP processor processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base") # Load the BLIP model trained for image captioning model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base") # Open the image file and ensure it's in RGB mode (3 color channels) image = Image.open(image_path).convert("RGB") # Preprocess the image and prepare it as model input inputs = processor(images=image, return_tensors="pt") # Generate a caption (model outputs tokens representing words) outputs = model.generate(**inputs) # Decode the token IDs back into a human-readable string caption = processor.decode(outputs[0], skip_special_tokens=True) # Remove extra spaces and return the final caption return caption.strip()
Try it out:
# Examplecaption = caption_image("photo.jpg") # Replace with the location of your image.print("Caption:", caption)
You can swap in the blip-large version later if you want more detail.
Step 4. Generate a response
Time to make it talk. This function combines the image and audio inputs and generates a friendly reply using GPT-2.
# Import Hugging Face tools for tokenization and causal language modelsfrom transformers import AutoTokenizer, AutoModelForCausalLMimport torch # PyTorch is the backend for running the modeldef generate_reply(caption: str, transcript: str, model_name: str = "gpt2") -> str: """ Generate a natural language reply that combines an image caption and an audio transcript. :param caption: Text describing the image :param transcript: Text transcribed from audio :param model_name: Hugging Face model name (default = "gpt2") :return: A generated reply string """ # Load the tokenizer (converts text <-> tokens) tokenizer = AutoTokenizer.from_pretrained(model_name) # Load the pretrained language model (causal LM like GPT-2) model = AutoModelForCausalLM.from_pretrained(model_name) # Put the model in evaluation mode (disables dropout, ensures consistency) model.eval() # Build the prompt that the model will use to generate a reply # We feed both the image description and the audio transcript prompt = ( "You are a creative AI assistant that understands both pictures and voices. " "Look at the image and listen to the audio. Then, craft a thoughtful reply that blends both together. " "Keep it engaging, friendly, and easy to follow.\n\n" f"Image provided: {caption}\n" f"Audio provided: {transcript}\n\n" "Assistant's response:" ) # Convert the prompt into token IDs and prepare it as PyTorch tensors inputs = tokenizer(prompt, return_tensors="pt") # Generate text without tracking gradients (saves memory and speeds up inference) with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=100, # Limit reply length do_sample=True, # Enable sampling for more variety temperature=0.7 # Controls randomness (lower = more deterministic) ) # Decode only the newly generated tokens (skip the original prompt) reply = tokenizer.decode( outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True ) # Return the final reply with whitespace trimmed return reply.strip()
The prompt is simple, but works surprisingly well. Adjust it depending on the tone or format you want.
Step 5. Put it all together
Here’s the main function that accepts an image and audio file, runs everything, and prints out a natural response.
# Import argparse to handle command-line argumentsimport argparsedef main(): # Create a parser for command-line arguments # The description shows up when running: python assistant.py --help parser = argparse.ArgumentParser(description="Multimodal assistant that sees and hears.") # Add a required argument for the image file path parser.add_argument("--image", required=True) # Add a required argument for the audio file path parser.add_argument("--audio", required=True) # Add an optional argument to choose which Whisper model size to use # Default = "base", but you can pass tiny, small, medium, or large-v3 parser.add_argument("--whisper_model", default="base") # Add an optional argument to choose the language model # Default = "gpt2", but you can use mistral, llama, etc. parser.add_argument("--lm_model", default="gpt2") # Parse the arguments from the command line args = parser.parse_args() # Generate a caption for the image using BLIP caption = caption_image(args.image) # Transcribe the audio file using faster-whisper (or Whisper) transcript = transcribe_audio(args.audio, model_name=args.whisper_model) # Generate a conversational reply combining caption + transcript reply = generate_reply(caption, transcript, model_name=args.lm_model) # Print all results in a readable format print("Caption:", caption) print("Transcript:", transcript) print("\nAssistant reply:\n", reply)# This ensures the main() function runs only when the file is executed directlyif __name__ == "__main__": main()
Run it like this:
python assistant.py --image files/image.jpg --audio files/Recording.m4a
And that’s it! Your assistant sees, hears, and responds.
This is just a prototype, but if you are interested in learning more about multimodal AI tool, what you can simply continue building on top of our assistant. For example, you can:
- Add a frontend with Streamlit or Gradio.
- Use LLaVA to replace BLIP + GPT-2 with a single model.
- Add video or live mic input.
- Run everything on-device for privacy.
The sky is the limit… :)
The Tools Powering Our Multimodal Assistant
Now that you’ve seen the code in action, let’s take a step back and look at the building blocks that made this possible. Each tool handles a different “sense”: listening, looking, and responding. And together they form the core of our multimodal assistant.
Listening with Faster-Whisper
For audio, we used faster-whisper, an optimized version of OpenAI’s Whisper model. It’s designed to run efficiently on both CPUs and GPUs, and it can handle formats like MP3, WAV, and M4A. Because it was trained on hundreds of thousands of hours of real audio, it’s robust against background noise and accents, making it a reliable choice for speech-to-text.
Looking with BLIP
On the vision side, we relied on BLIP (Bootstrapped Language-Image Pretraining). Available on Hugging Face, BLIP generates natural captions for images. Instead of simply recognizing objects, it describes scenes in human-like language, which gives your assistant the ability to “see” and explain what’s happening in a picture.
Responding with a Language Model
Once we had a transcript from the audio and a caption from the image, we used a language model to tie everything together into a natural reply. In this tutorial we kept it light with GPT-2, but Hugging Face makes it easy to plug in more advanced models like Mistral 7B or LLaMA if you need more depth or fluency.
Why Hugging Face Matters
Behind the scenes, Hugging Face is what makes this whole pipeline smooth. The Transformers library provides a consistent way to load and run these models without juggling multiple toolkits or APIs. It’s the reason we can go from “I want an assistant that sees and hears” to “here’s a working prototype” in just a few dozen lines of Python.
A Few Things to Keep in Mind
Before you get too excited showing off your new assistant, here are a couple of realities worth knowing:
- It won’t feel instant. Remember, you’re running three separate models back-to-back. There’s a little lag, and that’s normal. If you need speed, you might want to try end-to-end multimodal models like LLaVA.
- Your hardware sets the limits. Got a powerful GPU? Great, you’ll get smoother and more accurate results. Running on a modest laptop? Stick to the smaller models unless you’re in the mood for coffee breaks.
- The models can still get it wrong. Whisper and BLIP are surprisingly good, but they’re not perfect. Sometimes they’ll mishear a word or misdescribe an image. If accuracy matters, always double-check the output.
- Don’t forget privacy. Audio and images are personal data. The nice part about this setup is you can run it locally, but it’s still smart to think about how you handle and store user input.
Final Thoughts
The Hugging Face ecosystem has made it incredibly easy to tinker with multimodal AI. In just a few lines of Python, you can connect speech, vision, and language models and watch them work together like pieces of a puzzle.
And here’s the exciting part: multimodal AI isn’t just for big tech anymore. If you’re building AI systems, it’s time to think beyond text. People don’t only type. We talk, we share photos, we gesture, we record video. AI should meet us there.
Thanks to open-source models, you can now build assistants that see, hear, and respond in natural language, without paying for expensive APIs or being locked into closed platforms.
So here’s my challenge to you: experiment. Try swapping in different models, add new modalities, or even build your own assistant. And when you do, I’d love to hear about it. Share your experience with multimodal tools in the comments.
If this guide helped spark an idea, give it a few claps so more people can discover it , and let’s learn from each other’s multimodal experiments.