KOSMOS-2 Explained: Microsoft’s Multimodal Marvel

For years, vision language models have struggled to replicate the human ability to truly “see” and comprehend the spatial relationships between objects and language.

In 2023, Microsoft researchers developed a new model that directly tackles this, introducing grounding capabilities. This innovation empowers the AI to create direct links between textual descriptions and specific visual regions in an image.

This marks a significant leap from traditional models, which only provide general descriptions, to a system that can pinpoint exactly what it is talking about.

Therefore, I decided to put this new approach through rigorous, real-world experimentation. In this blog, I will show you precisely how Microsoft’s 2023 innovation performs and evaluate whether it truly delivers on its promi…

For years, vision language models have struggled to replicate the human ability to truly “see” and comprehend the spatial relationships between objects and language.

This marks a significant leap from traditional models, which only provide general descriptions, to a system that can pinpoint exactly what it is talking about.

What is Vision Language Model or VLM?

For years, AI development ran on parallel tracks. We had computer vision models that became incredibly adept at identifying objects in images, and we had natural language processing (NLP) models that learned to write, translate, and summarize text with stunning fluency.

A Vision-Language Model, or VLM, is the ambitious fusion of these two domains. It is an AI system designed to understand the intricate relationships between what we see and what we say.

How is VLM different from LLM or Large Language Model?

The most profound difference between a traditional Large Language Model (LLM) and a VLM lies in their knowledge. An LLM like the early versions of ChatGPT is a master of abstract concepts derived purely from text.

It can write a beautiful poem about a “sunset over a beach” without ever having “seen” one; its understanding is built entirely on the statistical patterns of how humans have written about sunsets.

VLMs, in contrast, use “the visual reality”. They are trained on massive datasets containing millions, or even billions, of image-text pairs.

This process forces the model to connect the word “sunset” not just to other words, but to the actual pixels, colors, and shapes that constitute a sunset in a photograph. This grounding in the visual world makes their understanding richer, more concrete, and fundamentally closer to how humans perceive the world.

What is KOSMOS-2?

Now, let me introduce you to KOSMOS-2, Microsoft’s ambitious step toward creating a model that can truly understand and interact with the visual data in ways that were previously impossible.

KOSMOS-2 is a Multimodal Large Language Model (MLLM) that enables new capabilities of perceiving object descriptions and grounding text to the visual world.

This model represents a significant advancement in multimodal AI by introducing what researchers call “grounding capability”, the ability to link textual descriptions directly to specific regions or objects in images.

Instead of designing a complex, separate module for object detection, the researchers reframed grounding as a language modeling problem. They accomplished this by creating a special data format that looks like a hyperlink in Markdown: [text span](bounding boxes).

When KOSMOS-2 generates a description, it looks something like this:

An image of <phrase>a snowman</phrase><object><patch_index_0044><patch_index_0863></object> warming himself by <phrase>a fire</phrase><object><patch_index_0005><patch_index_0911></object>.

About the architecture of KOSMOS-2

Let me walk you through the technical architecture that makes KOSMOS-2’s grounding capabilities possible.

Core Architecture Components

KOSMOS-2 is built on a Transformer-based causal language model that combines the power of GPT-style language modeling with CLIP-inspired vision processing. The architecture consists of several key components:

1. Vision Encoder The model uses a vision transformer to process images, extracting meaningful visual features that can be understood by the language model. This encoder converts raw pixel values into a sequence of visual tokens.

2. Text Encoder Based on XLMRoberta, the text encoder processes both regular text and special location tokens that represent spatial coordinates.

3. Multimodal Fusion Layer This crucial component combines visual and textual representations, enabling the model to understand the relationships between what it sees and what it reads.

Training Process

The model is trained using next-word prediction on the GrIT dataset, where the training loss considers both text tokens and location tokens equally. This unified approach allows the model to learn the associations between textual descriptions and spatial locations seamlessly.

During my implementation experiments, I found that the model’s text format looks like this:

<grounding> An image of a snowman wearing a hat and standing in front of a fire.

Experimenting with KOSMOS-2

KOSMOS-2 prompt uses the <grounding> tag to also add the location of objects in the result.

snowman image with prompt

Note: This result’s “cleanup & extract” is set to true before the result is shown. You can change it to false if you want the location address of objects in the result.


Question: what is the color of the snowman?
Answer: what is the color of the snowman? The color of snowman is white.
[]
--------------------------------------------------
Question: <grounding> what is the color of the snowman's hat?
Answer: what is the color of the snowman's hat? The snowman''s cap is blue.
[("The snowman''s cap", (40, 58), [(0.546875, 0.171875, 0.703125, 0.234375)])]
--------------------------------------------------
Question: what is snowman doing in the image?
Answer: what is snowman doing in the image? The snowman is sitting by the fire in the snow.
[]
--------------------------------------------------
Question: is something bad going to happen in the image?
Answer: is something bad going to happen in the image? yes
[]
--------------------------------------------------
Question: what will happen to snowman?
Answer: what will happen to snowman? Snowman will melt away in the heat of the fire.
[]
--------------------------------------------------
Question: what will happen if the snowman melts?
Answer: what will happen if the snowman melts? If the snowman melting, it will cause a lot of damage to the environment. Snowmen are known for their ability to melt snow, which is why they are often used as a symbol of winter. If the melting snowman is not controlled, it can lead to flooding, erosion, and other environmental issues. It is important to keep the snowman in a safe and controlled environment to prevent any damage to its surroundings.
[]
--------------------------------------------------

Result based on prompt

As you see, when I added <grounding>, it also provides the location of the objects.

The result of the KOSMOS-2 is both good and bad when asking a simple question like the colour of his hat; it gave the right answer with proper object location.

But when questions became complicated, like “what will happen if the snowman melts?”, it started hallucinating with gibberish, unrelated sentences.

In simple questions, it performed well, which shows it has not great but still better understanding of image than most of the model of its time.

two elephant image with prompt


Question: <grounding> An image of
Answer: An image of two lions standing next to each other in a zoo enclosure.
[('two lions', (12, 21), [(0.015625, 0.234375, 0.453125, 0.859375)])]
--------------------------------------------------
Question: Which animals are in this image?
Answer: Which animals are in this image? In the image, there are two animals.
[]
--------------------------------------------------
Question: What are the elephants doing in this image?
Answer: What are the elephants doing in this image? In the image, two elephants are standing close to each other, possibly in a zoo enclosure. They are both facing the same direction, with one elephant slightly to the left and the other slightly to its right.
[]
--------------------------------------------------

Result based on prompt

Here, in a simple question of describing the image, the model failed to understand and even provided us with the wrong answer, like “two lions “ which are not even in the image.

A simple question like “Which animals are in this image?” results are not meaningful.

KOSMOS failed here miserably.

telephone booth infront of big ben image with prompts


Question: <grounding> An image of
Answer: An image of a tall building and a clock tower in London, England.
[('a tall building', (12, 27), [(0.484375, 0.078125, 0.734375, 0.796875)]), ('a clock tower', (32, 45), [(0.546875, 0.078125, 0.640625, 0.671875)])]
--------------------------------------------------
Question: Where is this image taken?
Answer: Where is this image taken? This image is taken in London, England.
[]
--------------------------------------------------
Question: What is the tallest object in the image?
Answer: What is the tallest object in the image? The tallest object is a clock tower.
[]
--------------------------------------------------
Question: Any person in the image?
Answer: Any person in the image? Yes
[]
--------------------------------------------------
Question: <grounding> where is person in the image?
Answer: where is person in the image? The person in the image is a cityscape with a blue sky and a building with a clock tower.
[('the image', (44, 53), [(0.015625, 0.015625, 0.984375, 0.984375)])]
--------------------------------------------------

Result based on prompt

Again model works well here when asked with a simple question and even provided the accurate location, but when the question became complicated, it failed to understand the context of the language and object in the image.

The model failed to locate any person in the image and also provided us with wrong unrelated answer.

How to Run KOSMOS-2 on Your Local Device?

To run KOSMOS locally, follow these steps:

Step 1: Install the required libraries.


%pip install torch transformers pillow
%pip install accelerate pillow

Required Libraries

Step 2: Import the following libraries.


from PIL import Image
import requests
from transformers import AutoProcessor, Kosmos2ForConditionalGeneration
from io import BytesIO
import os

Importing the Libraries

Step 3: (Optional) Create a helper function to visualize the image using a URL.


def show_image(source):
"""
Display an image from a URL or a local file path.

Args:
source (str): The URL or local file path of the image.
"""
try:
if source.startswith("http://") or source.startswith("https://"):
# Load image from URL
response = requests.get(source)
response.raise_for_status()  # Raise exception for bad response
img = Image.open(BytesIO(response.content))
elif os.path.exists(source):
# Load image from local file path
img = Image.open(source)
else:
raise ValueError("Invalid source. Provide a valid URL or local file path.")

display(img)

except Exception as e:
print(f"Error displaying image: {e}")

Function to visualize the image

Step 4: Creating a function to implement KOSMOS-2 using HuggingFace.

Here, I am using the KOSMOS-2 model called microsoft/kosmos-2-patch14-224 . You can choose any model from the various available models in Huggingface for various other tasks.

You start by loading the model and the processor.


checkpoint = "microsoft/kosmos-2-patch14-224"
model = Kosmos2ForConditionalGeneration.from_pretrained(checkpoint)
processor = AutoProcessor.from_pretrained(checkpoint)

Loading Model and Processor

Then create a function to implement KOSMOS-2.


def kosmos2_generate(image, prompt="<grounding> Write your prompt", cleanup_and_extract=False):
"""
Generate text from an image using the Kosmos-2 model.

Args:
image (PIL.Image): The input image.
prompt (str): The text prompt to guide the generation.
max_new_tokens (int): Maximum number of tokens to generate.

Returns:
str: The generated text.
"""
inputs = processor(text=prompt, images=image, return_tensors="pt")

generated_ids = model.generate(
pixel_values=inputs["pixel_values"],
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
image_embeds=None,
image_embeds_position_mask=inputs["image_embeds_position_mask"],
use_cache=True,
max_new_tokens=256
)

generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
processed_text = processor.post_process_generation(generated_text, cleanup_and_extract=cleanup_and_extract)

return processed_text

Function to implement KOSMOS-2

Step 5: Implementing KOSMOS-2.


kosmos2_generate(image1, prompt="<grounding> An image of")

Provide the function with image and a prompt

Conclusion

From my hands-on experimentation with KOSMOS-2, it’s clear that Microsoft’s approach to grounding marks a step forward in bridging vision and language.

The model demonstrates impressive capabilities in associating textual phrases with visual regions, especially for straightforward prompts and well-lit, distinct images.

However, it still struggles with contextual complexity and logical reasoning, often producing hallucinated or incorrect outputs when faced with open-ended or abstract questions.

While KOSMOS-2 is not yet perfect, it signals a shift toward a more grounded and human-like understanding in multimodal AI. Its performance shows promise, particularly as a foundational model for downstream applications in captioning, VQA, and image-region alignment.

FAQs

Q1: What makes KOSMOS-2 different from other vision-language models?

A: KOSMOS-2 introduces grounding capabilities, enabling the model to link textual descriptions directly to image regions, unlike traditional VLMs that only generate general descriptions.

Q2: Why does KOSMOS-2 sometimes hallucinate or give wrong answers?

A: While strong at grounding basic prompts, KOSMOS-2 still lacks deep reasoning and context understanding, leading to hallucinations or irrelevant responses in complex scenarios.

Q3: Can I run KOSMOS-2 on my local device?

A: Yes, you can run KOSMOS-2 locally using Hugging Face Transformers with minimal setup and visualize image-grounded responses using a few lines of Python code.

References

Notebook

What is Vision Language Model or VLM?

How is VLM different from LLM or Large Language Model?

What is KOSMOS-2?

About the architecture of KOSMOS-2

Core Architecture Components

Training Process

Experimenting with KOSMOS-2

How to Run KOSMOS-2 on Your Local Device?

Conclusion

FAQs

Q1: What makes KOSMOS-2 different from other vision-language models?

Q2: Why does KOSMOS-2 sometimes hallucinate or give wrong answers?

Q3: Can I run KOSMOS-2 on my local device?

References

Similar Posts