ANNEXE: Unified Analyzing, Answering, and Pixel Grounding for Egocentric Interaction

07 Nov, 2025 *

Trying to make the AI understand an image from the first person perspective.

Proposed a novel task called Egocentric Interaction Reasoning and pixel-Grounding (Ego-IRG) to facilitate the study of comprehensive egocentric interaction.

Methodology:

The main goal of this task (Ego-IRG) is to have an AI respond to a user’s question about a first-person (egocentric) image by providing both a text answer and a visual one (a pixel-level highlight).

The Three Steps: It then explains that this process happens in three “progressive” sub-tasks. The word “progressive” here implies that the tasks build on one another in a logical sequence:

Analyzing: First, the AI looks at the image and gets a general understanding of the interaction happening. It’s the “big picture”…

07 Nov, 2025 *

Trying to make the AI understand an image from the first person perspective.

Proposed a novel task called Egocentric Interaction Reasoning and pixel-Grounding (Ego-IRG) to facilitate the study of comprehensive egocentric interaction.

Methodology:

The Three Steps: It then explains that this process happens in three “progressive” sub-tasks. The word “progressive” here implies that the tasks build on one another in a logical sequence:

Analyzing: First, the AI looks at the image and gets a general understanding of the interaction happening. It’s the “big picture” view.

Answering: Next, it focuses on the user’s specific question and formulates a direct text answer.

Pixel Grounding: Finally, it visually proves its answer by creating a precise mask or outline over the specific object(s) it just talked about in its answer.

Inputs given to the AI:

An egocentric image - I

A text query - T

Repsonse (Output) given by the AI:

R_D -> Repsonse Description: It is a general description that the AI generates of the scene that is presented to it

R_A -> Response Answer: The answer that is given in response to the text query (T) by the user.

R_M -> Response Mask: Visual proof - a mask/ outline over the object in the image

The ANNEXE Model:

First model of its kind which is able to produce boht text and visual answers as the output!

Text Generation Module: Basically deals with the text gen part of the architecture - responsible for generating R_D and R_A.

Image Generation Module: Deals with the image part of the architecture - R_M which is the masking/highlighting the part of the image input wrt to the text query given to it by the user.

Text Module - in depth

So the role of this involves 2 important parts:

Analyze: Look at the image and then provide a general description of what is happening

Answer: Understand the user’s question and then give an answer

So what to use?The authors realised that MLLMs are the best fit for this job as they are trained on lots of image and text data and they have an intuitive understanding of the relationship between words and the images they describe.

Now the thing is we have images as an input and MLLMs do not understand images. Think of it like this is not the language they can grasp so we need a translator that can convert the image into the inputs - numbers/features so that it can given to the MLLMs for their working.

So for this we have encoders that take in images and convert them to numbers (F_img_enc). This list of numbers captures the essence of the image—the objects, their positions, colors, and textures.

For the user query text we use another encoder “The Prompt Encoder” which reads the text of your query (T) and translates it into a different list of numbers (F_que_enc). This list captures the semantic meaning of your question.

Now we have a proper representation of both of our inputs - the image as well as the user query!

One doubt that i myself had while reading the paper till here was that - “I get the idea that we can generate an answer to the user query, afterall we have something as an input right? But how can we generate a general description of the image scene?

The answering part is easy but how the hell do we carry out the analyzing part of the response?“

And the paper answers it beautifully by introducing a hidden interaction description Ta to guide the MLLM to predict the interaction description R_D precisely.

This is a piece of text that the researchers give to the model behind the scenes, which essentially says, “For this task, please just describe the main interaction in the image.”

So considering this the MLLM recieves three inputs!

The image’s features (F_img_enc).

The hidden instruction’s features (F_ins_enc).

The user’s query’s features (F_que_enc).

Similar Posts