AI Agent Benchmark Compendium
philschmid.de·2d·
Discuss: Hacker News
Performance Mythology
Flag this post
Classic Demo Effects, Fire
4rknova.com·2d·
Discuss: Hacker News
📟Terminal Physics
Flag this post
Cracking the CRISPR code to find the 'passwords' that unlock its full potential
phys.org·2d
🧬Palindrome Codes
Flag this post
HiCoTraj:Zero-Shot Demographic Reasoning via Hierarchical Chain-of-Thought Prompting from Trajectory
arxiv.org·2d
🌍Cultural Algorithms
Flag this post
Scheming Ability in LLM-to-LLM Strategic Interactions
arxiv.org·1d
🔲Cellular Automata
Flag this post
Benchmarking Deep Learning Models for Laryngeal Cancer Staging Using the LaryngealCT Dataset
arxiv.org·3d
🎵Audio ML
Flag this post
Krish Naik: OpenAI Agentkit Is Lit- Automate Workflows With Ease
dev.to·2h·
Discuss: DEV
Proof Automation
Flag this post
Beyond Postconditions: Can Large Language Models infer Formal Contracts for Automatic Software Verification?
arxiv.org·2d
Formal Methods
Flag this post
Alif: Advancing Urdu Large Language Models via Multilingual Synthetic Data Distillation
arxiv.org·4d
🎙️Whisper
Flag this post
Automated Retrospective Analysis & Predictive Maintenance of Polymer Degradation Using Multi-Modal Deep Learning
dev.to·3d·
Discuss: DEV
🔍Vector Forensics
Flag this post
Protein as a Second Language for LLMs
arxiv.org·3d
🔢Denotational Semantics
Flag this post
Evaluating Reasoning Faithfulness in Medical Vision-Language Models using Multimodal Perturbations
arxiv.org·3d
📊Learned Metrics
Flag this post
Generative Latent Video Compression
arxiv.org·3d
🧠Learned Codecs
Flag this post
ACE-G: Improving Generalization of Scene Coordinate Regression Through Query Pre-Training
arxiv.org·3d
📐Projective Geometry
Flag this post
Skill-Targeted Adaptive Training
arxiv.org·3d
📊Learned Metrics
Flag this post
Reasoning Pattern Matters: Learning to Reason without Human Rationales
arxiv.org·2d
🧮Theorem Proving
Flag this post
Tech With Tim: Why 1M People Tried This AI Coding Tool (Full Vibe Coding Tutorial)
dev.to·1d·
Discuss: DEV
🌀Brotli Internals
Flag this post
Krish Naik: OpenAI Agentkit Is Lit- Automate Workflows With Ease
dev.to·14h·
Discuss: DEV
Proof Automation
Flag this post
Ensembling Large Language Models to Characterize Affective Dynamics in Student-AI Tutor Dialogues
arxiv.org·2h
🤖Grammar Induction
Flag this post
MTMD: A Multi-Task Multi-Domain Framework for Unified Ad Lightweight Ranking at Pinterest
arxiv.org·3d
🔍BitFunnel
Flag this post