Effective Prompting for Generative Vision Models

It’s likely that you’ve used a vision model to generate an image recently, but ended up with somewhat questionable results. You might have blamed this on the model not working correctly (and maybe that’s true), but it could also be because you didn’t give it the proper instructions.

A vision model will only create what it’s asked to, and how you ask matters. Prompting isn’t just about describing what you see; it’s about guiding the model so it interprets your request correctly. Just one word can sometimes double its accuracy.

In this blog, we’ll cover the key principles for prompting your vision models more effectively, from good practices to the nuances of different use cases. Whether you’re a developer, designer, marketer, or beginner, this guide will help you achieve the resul…

Where to Test Your Prompts

Before diving into how vision prompting works, let’s first look at where we can put it to the test. In this case, we’ll be using several endpoints available on Replicate, which we’ve optimized with Pruna to make them cheaper, faster, and more efficient. All of Pruna’s models are available here.

Prompting Good Practices

While there are nuances that can be applied to each use case, there are also several key principles that should always be kept in mind when prompting a model:

Give direction: State the goal, task, context, or desired style.
Be clear: Use precise, unambiguous language. You don’t need to describe every detail, just select the key words that matter most.
Split the work: If the goal is complex, break the prompt down into several chained steps.
Provide examples: If possible, include an example and reference it in your prompt.
Tune your prompts: Always review the output and refine your prompts based on the results to get better responses. Using a grid can be helpful.
Know your model: Review the model’s documentation or description. Some models support tags, parameters, or specific input formats that can significantly improve performance.

Prompting in Practice

From Words to Pictures

For image generation, you can craft the perfect prompt following a default structure: Subject + Subject’s Action + Style + Context.

Subject: Where is the focus of your image? It should be the main element of your image (person, object, animal, or scene).
Subject’s Action: What’s the subject doing? It should describe what the subject is doing or how it interacts with the environment.
Style: How is the image presented? It should specify the artistic direction or medium.
Context: How and where is it happening? It should include the background, lighting, atmosphere, mood, point of view, or colors.

When writing the prompt, make sure each element is descriptive and focused only on the specific element you want to generate, avoiding contradictions. If it’s abstract or vague, it can lead to unpredictable results. For example, a prompt like “The best thing you can draw” is too ambiguous and might not produce anything appealing or coherent. Similarly, simply copying and pasting random text from the internet won’t work well — the model will struggle to extract a clear meaning or visual direction from it.

From Text or Image to Video

For video generation, we can use a similar structure as for image generation. However, some extra aspects should be considered: Subject + Subject’s Action + Environment + Shot Type + Style + Context.

Subject: Who or what is the main focus of your video? It should be the main element of your scene (person, object, animal).
Subject’s Action: What’s the subject doing? It should describe what the subject is doing or how it interacts with the environment.
Environment: Where is it happening? It should include the scene details surrounding the subject.
Shot Type: What’s the camera’s perspective or movement? It should describe the angle, trajectory, movement, and speed of the camera.
Style: How is the image presented? It should specify the artistic direction or medium.
Context: How is it happening? It should include the background, lighting, atmosphere, mood, point of view, or colors.

Editing Images

For image editing, we should introduce a new prompt structure: Task + Target + Edit Type + Preservation

Task: What do you want to accomplish? It should define the main goal of the edit.
Target: What specific element should be edited? It should identify the subject or area to modify.
Edit Type: How should the change be applied? It should describe the method, intensity, or style of the edit.
Preservation: What should remain unchanged? It should specify which parts of the image mustn’t change.

More Considerations

On one hand, even though most vision models have recently improved — with greater care taken in training data and design — different biases can persist. That’s why, when prompting, it’s important not to reinforce them. You can mitigate this by evaluating the outputs to ensure diversity and representation, and by providing more context and detail.

On the other hand, prompting in vision models raises a range of ethical questions that go beyond bias. Therefore, it’s essential to consider factors such as consent, authorship, data protection, and manipulation when using them.

What’s Next

In conclusion, this blog post provides a structured and straightforward guide to get started with prompting a vision model. So, you can generate an image or video, or edit an existing one to suit your needs.

Enjoy the Quality and Efficiency!

Want to take it further?

Compress your own models with Pruna and give us a ⭐ to show your support!
Stay up to date with the latest AI efficiency research on our blog, explore our materials collection, or dive into our courses.
Join the conversation and stay updated in our Discord community.