How to Build a Multimodal AI Assistant That Sees and Hears (with Hugging Face)

9 min readSep 21, 2025

–

Press enter or click to view image in full size

AI is shifting fast. The best systems today aren’t just text-based anymore: they’re multimodal. They combine vision, speech, and language into one seamless interface that feels far more natural to humans.

Think about it. OpenAI’s GPT-4o made headlines with real-time voice conversations and image understanding. Google’s Gemini 2.5 added configurable thinking time and topped multimodal benchmarks. Even Claude 3.5 now interprets images and diagrams.

Multimodal AI isn’t an experiment on the fringes anymore. It’s quickly becoming the baseline. If you’re building AI products, tools, or agents, you can’t just think in text. You need to start designing systems that can see, listen, and respond across multipl…

9 min readSep 21, 2025

–

Press enter or click to view image in full size

Of course, there’s a catch. High-end commercial models are often closed source and expensive. But here’s the good news: the open-source ecosystem has exploded. You can now build your own multimodal AI assistant using free, widely available tools.

That’s exactly what this tutorial is about. I’ll walk you through creating a lightweight assistant in Python that can:

Listen to a voice recording and transcribe it with open-source speech recognition (faster-whisper).
Look at an image and describe it with image captioning AI (BLIP).
Combine both inputs and generate a natural-sounding reply with a small language model.

Press enter or click to view image in full size

We’re keeping this project lean, practical, and modular, powered by the open-source ecosystem at Hugging Face.

By the end of this guide, you’ll have a working prototype of an AI that sees and hears. It runs locally, can be integrated into apps, and gives you hands-on experience with the same building blocks powering the multimodal revolution.

Step 1. Set up your environment

Let’s get the basics out of the way (using Powershell).

# Create a new folder for the projectmkdir multimodal-assistant# Move into the project foldercd multimodal-assistant# Create a Python virtual environment in a folder called ".env"python -m venv .env# Activate the virtual environment (this makes sure pip installs go inside .env).env\Scripts\Activate.ps1# Install PyTorch and Torchvision (core libraries for deep learning in Python)pip install torch torchvision# Install Hugging Face Transformers (for BLIP, GPT-2, and other models)pip install transformers# Install faster-whisper (for speech-to-text transcription)pip install faster-whisper# Install Pillow (Python Imaging Library, required for handling images with BLIP)pip install pillow# Create a new empty file called assistant.py where we’ll add our codeNew-Item assistant.py -ItemType File

Step 2. Make your assistant listen

Here’s a function that loads Whisper and transcribes any audio file you give it.

# Import the optimized Whisper implementationfrom faster_whisper import WhisperModeldef transcribe_audio(audio_path: str, model_name: str = "base") -> str:    """    Transcribe speech from an audio file into text using faster-whisper.    :param audio_path: Path to the audio file (e.g., .wav, .mp3, .m4a)    :param model_name: Which Whisper model size to use (tiny, base, small, medium, large-v3)    :return: A string containing the transcribed text    """    # Load the Whisper model    # "device" can be "cpu" for most systems, or "cuda" if you have an NVIDIA GPU configured    model = WhisperModel(model_name, device="cpu")    # Run the transcription process    # segments = generator that yields text chunks with timestamps    # info = metadata about the transcription (language, duration, etc.)    segments, info = model.transcribe(audio_path)    # Collect all the text segments into a single string    # Strip removes extra spaces around each chunk    return " ".join(s.text.strip() for s in segments).strip()

Try it out:

# Exampletranscript = transcribe_audio("files/Recording.m4a") # Replace with the location of your audio file.print("Transcript:", transcript)

It supports WAV, MP3, M4A and a bunch of other formats. It even figures out the spoken language automatically.

Step 3. Make it see

Now we give our assistant eyes. BLIP generates smart captions for any image.

# Import BLIP (Bootstrapped Language-Image Pretraining) tools from Hugging Face# BlipProcessor handles preprocessing (turning images into tensors + tokenization)# BlipForConditionalGeneration is the actual image captioning modelfrom transformers import BlipProcessor, BlipForConditionalGeneration# PIL (Python Imaging Library) is used to load and handle imagesfrom PIL import Imagedef caption_image(image_path: str) -> str:    """    Generate a descriptive caption for an image using BLIP.    :param image_path: Path to the image file (jpg, png, etc.)    :return: A string caption describing the image    """    # Load the BLIP processor    processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")    # Load the BLIP model trained for image captioning    model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")        # Open the image file and ensure it's in RGB mode (3 color channels)    image = Image.open(image_path).convert("RGB")    # Preprocess the image and prepare it as model input    inputs = processor(images=image, return_tensors="pt")    # Generate a caption (model outputs tokens representing words)    outputs = model.generate(**inputs)    # Decode the token IDs back into a human-readable string    caption = processor.decode(outputs[0], skip_special_tokens=True)    # Remove extra spaces and return the final caption    return caption.strip()

Try it out:

# Examplecaption = caption_image("photo.jpg") # Replace with the location of your image.print("Caption:", caption)

You can swap in the blip-large version later if you want more detail.

Step 4. Generate a response

Time to make it talk. This function combines the image and audio inputs and generates a friendly reply using GPT-2.

# Import Hugging Face tools for tokenization and causal language modelsfrom transformers import AutoTokenizer, AutoModelForCausalLMimport torch  # PyTorch is the backend for running the modeldef generate_reply(caption: str, transcript: str, model_name: str = "gpt2") -> str:    """    Generate a natural language reply that combines an image caption and an audio transcript.    :param caption: Text describing the image    :param transcript: Text transcribed from audio    :param model_name: Hugging Face model name (default = "gpt2")    :return: A generated reply string    """    # Load the tokenizer (converts text <-> tokens)    tokenizer = AutoTokenizer.from_pretrained(model_name)    # Load the pretrained language model (causal LM like GPT-2)    model = AutoModelForCausalLM.from_pretrained(model_name)    # Put the model in evaluation mode (disables dropout, ensures consistency)    model.eval()    # Build the prompt that the model will use to generate a reply    # We feed both the image description and the audio transcript    prompt = (        "You are a creative AI assistant that understands both pictures and voices. "        "Look at the image and listen to the audio. Then, craft a thoughtful reply that blends both together. "        "Keep it engaging, friendly, and easy to follow.\n\n"        f"Image provided: {caption}\n"        f"Audio provided: {transcript}\n\n"        "Assistant's response:"    )    # Convert the prompt into token IDs and prepare it as PyTorch tensors    inputs = tokenizer(prompt, return_tensors="pt")    # Generate text without tracking gradients (saves memory and speeds up inference)    with torch.no_grad():        outputs = model.generate(            **inputs,            max_new_tokens=100,       # Limit reply length            do_sample=True,           # Enable sampling for more variety            temperature=0.7           # Controls randomness (lower = more deterministic)        )    # Decode only the newly generated tokens (skip the original prompt)    reply = tokenizer.decode(        outputs[0][inputs.input_ids.shape[1]:],        skip_special_tokens=True    )    # Return the final reply with whitespace trimmed    return reply.strip()

The prompt is simple, but works surprisingly well. Adjust it depending on the tone or format you want.

Step 5. Put it all together

Here’s the main function that accepts an image and audio file, runs everything, and prints out a natural response.

# Import argparse to handle command-line argumentsimport argparsedef main():    # Create a parser for command-line arguments    # The description shows up when running: python assistant.py --help    parser = argparse.ArgumentParser(description="Multimodal assistant that sees and hears.")    # Add a required argument for the image file path    parser.add_argument("--image", required=True)    # Add a required argument for the audio file path    parser.add_argument("--audio", required=True)    # Add an optional argument to choose which Whisper model size to use    # Default = "base", but you can pass tiny, small, medium, or large-v3    parser.add_argument("--whisper_model", default="base")    # Add an optional argument to choose the language model    # Default = "gpt2", but you can use mistral, llama, etc.    parser.add_argument("--lm_model", default="gpt2")    # Parse the arguments from the command line    args = parser.parse_args()    # Generate a caption for the image using BLIP    caption = caption_image(args.image)    # Transcribe the audio file using faster-whisper (or Whisper)    transcript = transcribe_audio(args.audio, model_name=args.whisper_model)    # Generate a conversational reply combining caption + transcript    reply = generate_reply(caption, transcript, model_name=args.lm_model)    # Print all results in a readable format    print("Caption:", caption)    print("Transcript:", transcript)    print("\nAssistant reply:\n", reply)# This ensures the main() function runs only when the file is executed directlyif __name__ == "__main__":    main()

Run it like this:

python assistant.py --image files/image.jpg --audio files/Recording.m4a

And that’s it! Your assistant sees, hears, and responds.

This is just a prototype, but if you are interested in learning more about multimodal AI tool, what you can simply continue building on top of our assistant. For example, you can:

Add a frontend with Streamlit or Gradio.
Use LLaVA to replace BLIP + GPT-2 with a single model.
Add video or live mic input.
Run everything on-device for privacy.

The sky is the limit… :)

The Tools Powering Our Multimodal Assistant

Now that you’ve seen the code in action, let’s take a step back and look at the building blocks that made this possible. Each tool handles a different “sense”: listening, looking, and responding. And together they form the core of our multimodal assistant.

Listening with Faster-Whisper

For audio, we used faster-whisper, an optimized version of OpenAI’s Whisper model. It’s designed to run efficiently on both CPUs and GPUs, and it can handle formats like MP3, WAV, and M4A. Because it was trained on hundreds of thousands of hours of real audio, it’s robust against background noise and accents, making it a reliable choice for speech-to-text.

Looking with BLIP

On the vision side, we relied on BLIP (Bootstrapped Language-Image Pretraining). Available on Hugging Face, BLIP generates natural captions for images. Instead of simply recognizing objects, it describes scenes in human-like language, which gives your assistant the ability to “see” and explain what’s happening in a picture.

Responding with a Language Model

Once we had a transcript from the audio and a caption from the image, we used a language model to tie everything together into a natural reply. In this tutorial we kept it light with GPT-2, but Hugging Face makes it easy to plug in more advanced models like Mistral 7B or LLaMA if you need more depth or fluency.

Why Hugging Face Matters

Behind the scenes, Hugging Face is what makes this whole pipeline smooth. The Transformers library provides a consistent way to load and run these models without juggling multiple toolkits or APIs. It’s the reason we can go from “I want an assistant that sees and hears” to “here’s a working prototype” in just a few dozen lines of Python.

A Few Things to Keep in Mind

Before you get too excited showing off your new assistant, here are a couple of realities worth knowing:

It won’t feel instant. Remember, you’re running three separate models back-to-back. There’s a little lag, and that’s normal. If you need speed, you might want to try end-to-end multimodal models like LLaVA.
Your hardware sets the limits. Got a powerful GPU? Great, you’ll get smoother and more accurate results. Running on a modest laptop? Stick to the smaller models unless you’re in the mood for coffee breaks.
The models can still get it wrong. Whisper and BLIP are surprisingly good, but they’re not perfect. Sometimes they’ll mishear a word or misdescribe an image. If accuracy matters, always double-check the output.
Don’t forget privacy. Audio and images are personal data. The nice part about this setup is you can run it locally, but it’s still smart to think about how you handle and store user input.

Final Thoughts

The Hugging Face ecosystem has made it incredibly easy to tinker with multimodal AI. In just a few lines of Python, you can connect speech, vision, and language models and watch them work together like pieces of a puzzle.

And here’s the exciting part: multimodal AI isn’t just for big tech anymore. If you’re building AI systems, it’s time to think beyond text. People don’t only type. We talk, we share photos, we gesture, we record video. AI should meet us there.

Thanks to open-source models, you can now build assistants that see, hear, and respond in natural language, without paying for expensive APIs or being locked into closed platforms.

So here’s my challenge to you: experiment. Try swapping in different models, add new modalities, or even build your own assistant. And when you do, I’d love to hear about it. Share your experience with multimodal tools in the comments.

If this guide helped spark an idea, give it a few claps so more people can discover it , and let’s learn from each other’s multimodal experiments.

Step 1. Set up your environment

Step 2. Make your assistant listen

Step 3. Make it see

Step 4. Generate a response

Step 5. Put it all together

The Tools Powering Our Multimodal Assistant

Listening with Faster-Whisper

Looking with BLIP

Responding with a Language Model

Why Hugging Face Matters

A Few Things to Keep in Mind

Final Thoughts

Similar Posts