Multimodal AI

Feeds to Scour
SubscribedAll
Scoured 796 posts in 3.9 ms

3D-CoS: A New 3D Reconstruction Paradigm Based on VLM Code Synthesis

 👁️Computer Vision  Content type: Academic
arxiv.org·

Decoding Pedestrian Crossing Intention from Egocentric Vision via Vision Language Models

 🛡️AI Safety  Content type: Academic
arxiv.org·

Beyond Symmetric Alignment: Spectral Diagnostics of Modality Imbalance in Vision-Language Models in the Medical Domain

 👁️Computer Vision  Content type: Academic
arxiv.org·

Modeling Complex Behaviors: Multi-Personality Composition and Dynamic Switching in Vision-Language Models

 🧠LLM Research  Content type: Academic
arxiv.org·

AnyMod-LLVE: Low-Light Video Enhancement with Modality-Agnostic Inference

 🧠LLM Research  Content type: Academic
arxiv.org·

CLASP: Language-Driven Robot Skill Selection and Composition using Task-Parameterized Learning

 🛡️AI Safety  Content type: Academic
arxiv.org·

NextMotionQA: Benchmarking and Judging Human Motion Understanding with Vision-Language Models

 🤖Robotics  Content type: Academic
arxiv.org·

Harnessing Streaming Video in the Wild

 🤖AI Engineering  Content type: Academic
arxiv.org·

From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs

 🧠LLM Research  Content type: Academic
arxiv.org·

A Dataset for Dynamic Human Preferences for Vision Language Models

 🧠LLM Research  Content type: Academic
arxiv.org·

Adversarial Attacks Already Tell the Answer: Directional Bias-Guided Test-time Defense for Vision-Language Models

 👁️Computer Vision  Content type: Academic
arxiv.org·

Do VLMs See What Sensors Feel? A Scalable Expert-Guided Design for Wheelchair Accessibility Assessment from Street View

 👁️Computer Vision  Content type: Academic
arxiv.org·

UniCanvas: A Diffusion-base Unified Model for Text-in-Image Joint Generation

 👁️Computer Vision  Content type: Academic
arxiv.org·

MemoVAD: Resource-Efficient Video Anomaly Detection via Dynamic Semantic Memory in Edge Computing Scenarios

 🤖AI Engineering  Content type: Academic
arxiv.org·

Anchored, Not Graded: Vision-Language Models Fail at Slant-from-Texture Perception

 🧠LLM Research  Content type: Academic
arxiv.org·

EasyLens: A Training-Free Plug-and-Play Subtle-Lesion Representation Amplifier for Medical Vision-Language Models

 👁️Computer Vision  Content type: Academic
arxiv.org·

DyCo-RL: Dynamic Cross-Modal Coordination for Visual Reasoning

 🎯Reinforcement Learning  Content type: Academic
arxiv.org·

Query-based Cross-Modal Projector Bolstering Mamba Multimodal LLM

 🧠LLM Research  Content type: Academic
arxiv.org·

ChinaHeritaQA: A Culturally-Grounded Visual Question Answering Dataset for World Heritage Sites in China

 🧠LLM Research  Content type: Academic
arxiv.org·

Stream3D-VLM: Online 3D Spatial Understanding with Incremental Geometry Priors

 👁️Computer Vision  Content type: Academic
arxiv.org·
Sign up or log in to see more results

Keyboard Shortcuts

Navigation

Next / previous item
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help