📊 Model Evaluation - tarokuriyama · Scour

Beyond ATE: Multi-Criteria Design for A/B Testing

arxiv.org·1d

AI dev tool power rankings & comparison [Feb. 2026]

blog.logrocket.com·1d

LLM Optimization: From Research to Production

dev.to·4h·

Discuss: DEV

guard0-ai/TrustVector: Independent, evidence-based trust evaluations for 100+ AI models, agents, and tools.

github.com·20h·

Discuss: Hacker News

Studying Quality Improvements Recommended via Manual and Automated Code Review

arxiv.org·1d

🔧Functional Programming

You are probably overpaying for intelligence

residuals.bearblog.dev·21h

BalatroBench Benchmarks Large Language Models Playing Balatro

balatrobench.com·1d·

Discuss: Hacker News

🔧Functional Programming

jmduke.com·18h

🔧Functional Programming

SWE-rebench Jan 2026: GLM-5, MiniMax M2.5, Qwen3-Coder-Next, Opus 4.6, Codex Performance

swe-rebench.com·1d·

Discuss: r/LocalLLaMA

Why 90% of Backtests Lie: Introducing Kiploks Data Quality Guard (DQG)

dev.to·3h·

Discuss: DEV

🔧Functional Programming

Website Performance Audits for Kirby CMS

audit.bnomei.com·3h

The case for industrial evals

lesswrong.com·1d

🔗 Better tests, zero drama: smarter LiveIsolatedComponent patterns

yellowduck.be·4h

Joint optimization of maintenance and spare parts management in upstream – downstream systems under quality control

sciencedirect.com·1d

🔧Functional Programming

AI Study Platforms

trendhunter.com·6h

BinaryAudit: Can AI find backdoors in raw machine code?

quesma.com·1d·

Discuss: Hacker News

Industrial Automation Platform

autonomylogic.com·3h

MiniMax-AI/MiniMax-M2.5

github.com·5h

Table of Contents - Data Engineering for Large Models: Architecture, Algorithms & Projects

datascale-ai.github.io·14h·

Discuss: Lobsters

Beyond the Prompt - Why and How to Fine-tune Your Own Models

devblogs.microsoft.com·3d

Loading more...