VLMs

Feeds to Scour
SubscribedAll
Scoured 258 posts in 4.7 ms

Personal AI Agent for Camera Roll VQA

 🕵️AI Agents  Content type: Academic
arxiv.org·
Less-relevant results

World Model Self-Distillation: Training World Models to Solve General Tasks

 🎭Multimodal AI  Content type: Academic
arxiv.org·

Readable Yet Unpredictable: Rotated-Outcome Prediction in Vision-Language Models

 🎭Multimodal AI  Content type: Academic
arxiv.org·

ChinaHeritaQA: A Culturally-Grounded Visual Question Answering Dataset for World Heritage Sites in China

 🎭Multimodal AI  Content type: Academic
arxiv.org·

GP-Adapter: Gaussian Process CLIP-Adapter for Few-Shot Out-of-Distribution Detection

 🎭Multimodal AI  Content type: Academic
arxiv.org·

Are Reasoning Vision-Language Models Robust to Semantic Visual Distractions?

 🎭Multimodal AI  Content type: Academic
arxiv.org·

Adversarial Attacks Already Tell the Answer: Directional Bias-Guided Test-time Defense for Vision-Language Models

 🎭Multimodal AI  Content type: Academic
arxiv.org·

Aligned but Not Partner-Specific: Distinguishing How Multimodal LLM Agents Succeed in Reference Games Without Human-Like Conventions

 🔧Tool Use  Content type: Academic
arxiv.org·

SS-TPT: Stability and Suitability-Guided Test-Time Prompt Tuning for Adversarially Robust Vision-Language Models

 🎭Multimodal AI  Content type: Academic
arxiv.org·

VQA for Dynamic Portfolio Optimization: Sampling Strategies, Optimizer Scheduling, and Hardware-Aware Ansatz Design

 Quantization  Content type: Academic
arxiv.org·

APT: Action Expert Pretraining Improves Instruction Generalization of Vision-Language-Action Policies

 🎭Multimodal AI  Content type: Academic
arxiv.org·

Do Vision-Language Models See or Guess? Measuring and Reducing Textual-Prior Reliance with a Phrasing-Controlled Benchmark

 🎭Multimodal AI  Content type: Academic
arxiv.org·

TEVI: Text-Conditioned Editing of Visual Representations via Sparse Autoencoders for Improved Vision-Language Alignment

 🎭Multimodal AI  Content type: Academic
arxiv.org·

ProcessThinker: Enhancing Multi-modal Large Language Models Reasoning via Rollout-based Process Reward

 💡AI Reasoning  Content type: Academic
arxiv.org·

FADA: Accessible fetal ultrasound interpretation and annotation with a selectively distilled unified vision-language model

 🎭Multimodal AI  Content type: Academic
arxiv.org·

When CLIP Sees More, It Fights Back Harder: Multi-View Guided Adaptive Counterattacks for Test-Time Adversarial Robustness

 🎭Multimodal AI  Content type: Academic
arxiv.org·

Steer Where It Matters: Token-Level Visual-Sensitivity Steering for LVLMs Hallucination Mitigation

 🎭Multimodal AI  Content type: Academic
arxiv.org·

DIRECT: When and Where Should You Allocate Test-Time Compute in Embodied Planners?

 🖥️Inference Compute  Content type: Academic
arxiv.org·

SD-GRPO: Verifiable Segment Decomposition for Long-Form Vision-Language Generation

 🧠LLMs  Content type: Academic
arxiv.org·

UNIVID: Unified Vision-Language Model for Video Moderation

 🎭Multimodal AI  Content type: Academic
arxiv.org·
Sign up or log in to see more results

Keyboard Shortcuts

Navigation

Next / previous item
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help