🚀 Introducing ERNIE-4.5-VL-28B-A3B-Thinking: A Breakthrough in Multimodal AI
Model Highlights
Built upon the powerful ERNIE-4.5-VL-28B-A3B architecture, the newly upgraded ERNIE-4.5-VL-28B-A3B-Thinking achieves a remarkable leap forward in multimodal reasoning capabilities. 🧠✨ Through an extensive mid-training phase, the model absorbed a vast and highly diverse corpus of premium visual-language reasoning data. This massive-scale training process dramatically boosted the model’s representation power while deepening the semantic alignment between visual and language modalities—unlocking unprecedented capabilities in nuanced visual-textual reasoning. 📊
The model leverages cutting-edge multimodal reinforcement learning techniques on verifiable tasks, integrating GSPO and…
🚀 Introducing ERNIE-4.5-VL-28B-A3B-Thinking: A Breakthrough in Multimodal AI
Model Highlights
Built upon the powerful ERNIE-4.5-VL-28B-A3B architecture, the newly upgraded ERNIE-4.5-VL-28B-A3B-Thinking achieves a remarkable leap forward in multimodal reasoning capabilities. 🧠✨ Through an extensive mid-training phase, the model absorbed a vast and highly diverse corpus of premium visual-language reasoning data. This massive-scale training process dramatically boosted the model’s representation power while deepening the semantic alignment between visual and language modalities—unlocking unprecedented capabilities in nuanced visual-textual reasoning. 📊
The model leverages cutting-edge multimodal reinforcement learning techniques on verifiable tasks, integrating GSPO and IcePop strategies to stabilize MoE training combined with dynamic difficulty sampling for exceptional learning efficiency. ⚡ Responding to strong community demand, we’ve significantly strengthened the model’s grounding performance with improved instruction-following capabilities, making visual grounding functions more accessible than ever. 🎯 Additionally, our innovative “Thinking with Images” feature, when paired with tools like image zooming and image search, dramatically elevates the model’s ability to process fine-grained details and handle long-tail visual knowledge. 🔍🖼️
Together, these enhancements form a critical foundation for developing sophisticated multimodal agents, empowering developers and researchers to create next-generation AI applications that push the boundaries of what’s possible in visual-language understanding. 🤖🌟
Key Capabilities
As a lightweight model that activates only 3B parameters ⚡, ERNIE-4.5-VL-28B-A3B-Thinking closely matches the performance of the industry’s top flagship models across various benchmarks. 🚀
- Visual Reasoning 🧠👁️: Bolstered by large-scale reinforcement learning, the model demonstrates exceptional multi-step reasoning, chart analysis, and causal reasoning capabilities in complex visual tasks! 📊✨
- STEM Reasoning 🔬📐: Leveraging its powerful visual abilities, the model achieves a leap in performance on STEM tasks like solving problems from photos, easily handling even complex questions! 🎯💡
- Visual Grounding 📍🎨: Features more precise grounding and flexible instruction execution, easily triggering grounding functions in complex industrial scenarios for a significant efficiency boost! ⚙️💪
- Thinking with Images 🤔🔍: The model thinks like a human, capable of freely zooming in and out of images to grasp every detail and uncover all information. 🖼️✨
- Tool Utilization 🛠️⚡: Empowered by robust tool-calling capabilities, the model can instantly use functions like image search to easily identify long-tail knowledge and achieve comprehensive information retrieval! 🔎📚
- Video Understanding 🎬🎥: The model possesses outstanding temporal awareness and event localization abilities, accurately identifying content changes across different time segments in a video, making video analysis smarter and more efficient! ⏱️🌟
Quickstart
Using transformers Library
Here is an example of how to use the transformers library for inference:
import torch
from transformers import AutoProcessor, AutoTokenizer, AutoModelForCausalLM
model_path = 'baidu/ERNIE-4.5-VL-28B-A3B-Thinking'
model = AutoModelForCausalLM.from_pretrained(
model_path,
device_map="auto",
dtype=torch.bfloat16,
trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model.add_image_preprocess(processor)
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What color clothes is the girl in the picture wearing?"
},
{
"type": "image_url",
"image_url": {
"url": "https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example1.jpg"
}
},
]
},
]
text = processor.tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
image_inputs, video_inputs = processor.process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
device = next(model.parameters()).device
inputs = inputs.to(device)
generated_ids = model.generate(
inputs=inputs['input_ids'].to(device),
**inputs,
max_new_tokens=1024,
use_cache=False
)
output_text = processor.decode(generated_ids[0][len(inputs['input_ids'][0]):])
print(output_text)
vLLM Inference
Install the vLLM main branch
pip install uv
uv pip install -U vllm --pre \
--extra-index-url https://wheels.vllm.ai/nightly \
--extra-index-url https://download.pytorch.org/whl/cu129 \
--index-strategy unsafe-best-match
Run vLLM
# 80G*1 GPU,If an error occurs, add the --gpu-memory-utilization 0.95 and try again
vllm serve baidu/ERNIE-4.5-VL-28B-A3B-Thinking --trust-remote-code
Run vLLM using reasoning-parser and tool-call-parser
# 80G*1 GPU,If an error occurs, add the --gpu-memory-utilization 0.95 and try again
vllm serve baidu/ERNIE-4.5-VL-28B-A3B-Thinking --trust-remote-code \
--reasoning-parser ernie45 \
--tool-call-parser ernie45 \
--enable-auto-tool-choice
FastDeploy Inference
Quickly deploy services using FastDeploy as shown below. For more detailed usage, refer to the FastDeploy GitHub Repository.
Note: For single-card deployment, at least 80GB of GPU memory is required.
fastdeploy serve --model baidu/ERNIE-4.5-VL-28B-A3B-Thinking \
--max-model-len 131072 \
--max-num-seqs 32 \
--port 8180 \
--quantization wint8 \
--reasoning-parser ernie-45-vl-thinking \
--tool-call-parser ernie-45-vl-thinking \
--mm-processor-kwargs '{"image_max_pixels": 12845056 }'
Finetuning with ERNIEKit
ERNIEKit is a training toolkit based on PaddlePaddle, specifically designed for the ERNIE series of open-source large models. It provides comprehensive support for scenarios such as instruction fine-tuning (SFT, LoRA) and alignment training (DPO), ensuring optimal performance.
Usage Examples:
# Download model
huggingface-cli download baidu/ERNIE-4.5-VL-28B-A3B-Thinking --local-dir baidu/ERNIE-4.5-VL-28B-A3B-Thinking
# SFT
erniekit train examples/configs/ERNIE-4.5-VL-28B-A3B-Thinking/sft/run_sft_lora_8k.yaml
# SFT (Function Call)
erniekit train examples/configs/ERNIE-4.5-VL-28B-A3B-Thinking/sft_function_call/run_sft_8k.yaml
For more detailed examples, including SFT with LoRA, multi-GPU configurations, and advanced scripts, please refer to the examples folder within the ERNIEKit repository.
License
The ERNIE 4.5 models are provided under the Apache License 2.0. This license permits commercial use, subject to its terms and conditions. Copyright (c) 2025 Baidu, Inc. All Rights Reserved.
Citation
If you find ERNIE 4.5 useful or wish to use it in your projects, please kindly cite our technical report:
@misc{ernie2025technicalreport,
title={ERNIE 4.5 Technical Report},
author={Baidu-ERNIE-Team},
year={2025},
primaryClass={cs.CL},
howpublished={\url{https://ernie.baidu.com/blog/publication/ERNIE_Technical_Report.pdf}}
}