🖼️ Multimodal AI - hop1.ng.1357 · Scour

Can Multimodal Large Language Models Truly Understand Small Objects? ✨Gemini

Building with Gemini Embedding 2: Agentic multimodal RAG and beyond ✨Gemini

developers.googleblog.com·11h

SketchVLM: Vision-Language Models Can Annotate Images to Explain Thoughts and Guide Users ✨Gemini

sketchvlm.github.io·2d·Hacker News

A benchmark multimodal oro-dental dataset for large vision-language models ✨Gemini

nature.com·16h

Introducing NVIDIA Nemotron 3 Nano Omni: Long-Context Multimodal Intelligence for Documents, Audio and Video Agents ✨Gemini

huggingface.co·2d·Hacker News

Kronk AI: Hugging Face & Vision Model File Formats ✨Gemini

youtube.com·3d

Nvidia combines speech, vision, and text in new AI model ✨Gemini

techzine.eu·19h

Zorq AI – Multimodal workspace for video, image, and voice generation ✨Gemini

zorq-ai.io·1d·Hacker News

Mm – Unix tools (find/cat/grep) rebuilt for the multimodal era 🕷️Web Crawling

vlm.run·18h·Hacker News

snorcack/CharacterGeneration: A project to generate character portraits from book text using RAG based book search and image generation using local diffusion models. 🎖Text Quality Models

github.com·2d·r/StableDiffusion

$M^2$-VLA: Boosting Vision-Language Models for Generalizable Manipulation via Layer Mixture and Meta-Skills ✨Gemini

Qwen3.6–35B-A3B: The Most Practical Open-Source AI Model Yet? ⚡Edge AI

·19h

Xiaomi releases open-weight MiMo-V2.5 AI model, claims "frontier-level agentic capability" 🇨🇳Chinese AI

gsmarena.com·2d

Multimodal data integration in orthopedic regenerative medicine: bridging imaging, omics, and clinical data ✨Gemini

frontiersin.org·22h

TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment 🎖Text Quality Models

gdm-tipsv2.github.io·6d·Hacker News

Microsoft World-R1 for 3D-Consistent Video Generation (4 minute read) ✨Gemini

microsoft.github.io·1d

‘The whale can now see’: DeepSeek adds AI vision in major move 🇨🇳Chinese AI

·1d·r/SCMPauto

AI State of the Union | Human-Centered Change and Innovation 🎭Claude

bradenkelley.com·4d

Delineating Knowledge Boundaries for Honest Large Vision-Language Models ✨Gemini

typomonster/parlor-jarvis: On-device, real-time multimodal AI. Multilingual voice + vision (en/ko/es/pt/fr) with camera, screen, PDF, and video — runs entirely locally. ✨Gemini

github.com·4d·Hacker News

Log in to enable infinite scrolling