📊 Model Evaluation - tarokuriyama · Scour

Beyond ATE: Multi-Criteria Design for A/B Testing

arxiv.org·21h

AI dev tool power rankings & comparison [Feb. 2026]

blog.logrocket.com·9h

guard0-ai/TrustVector: Independent, evidence-based trust evaluations for 100+ AI models, agents, and tools.

github.com·3h·

Discuss: Hacker News

BalatroBench Benchmarks Large Language Models Playing Balatro

balatrobench.com·15h·

Discuss: Hacker News

🔧Functional Programming

Constructing Industrial-Scale Optimization Modeling Benchmark

arxiv.org·1d

🔧Functional Programming

You are probably overpaying for intelligence

residuals.bearblog.dev·5h

Clean Architecture in .NET 10: Testing What Matters

dev.to·21h·

Discuss: DEV

🔧Functional Programming

SWE-rebench Jan 2026: GLM-5, MiniMax M2.5, Qwen3-Coder-Next, Opus 4.6, Codex Performance

swe-rebench.com·8h·

Discuss: r/LocalLLaMA

The case for industrial evals

lesswrong.com·1d

Joint optimization of maintenance and spare parts management in upstream – downstream systems under quality control

sciencedirect.com·10h

🔧Functional Programming

Data Engineering for Large Models: Architecture, Algorithms & Projects

github.com·1h

🔧Functional Programming

BinaryAudit: Can AI find backdoors in raw machine code?

quesma.com·11h·

Discuss: Hacker News

Quality Assurance in AI Assisted Software Development: Risks and Implications

dev.to·1d·

Discuss: DEV

Beyond the Prompt - Why and How to Fine-tune Your Own Models

devblogs.microsoft.com·2d

Reflections on prototyping a sysadmin benchmark

samek.fyi·6h

🔧Functional Programming

My Skill Makes Claude Code GREAT At TDD

aihero.dev·10h

🔧Functional Programming

MiniMaxAI MiniMax-M2.5 has 230b parameters and 10b active parameters

openhands.dev·1d·

Discuss: r/LocalLLaMA

5 Days, One GPU Gameboy Swarm

bkase.io·12h·

Discuss: Hacker News

Olmix: A framework for data mixing throughout LM development

allenai.org·10h

🔧Functional Programming

Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration

machinelearning.apple.com·1d

🔧Functional Programming

Loading more...