📊 Model Evaluation - tarokuriyama · Scour

Constructing Industrial-Scale Optimization Modeling Benchmark

arxiv.org·2d

🔧Functional Programming

Why 90% of Backtests Lie: Introducing Kiploks Data Quality Guard (DQG)

dev.to·5h·

Discuss: DEV

🔧Functional Programming

Beyond the Prompt - Why and How to Fine-tune Your Own Models

devblogs.microsoft.com·3d

Reflections on prototyping a sysadmin benchmark

samek.fyi·1d

🔧Functional Programming

チームのテスト力を総合的に鍛えてシフトレフトを推進する/Shifting Left with Software Testing Improvements

speakerdeck.com·14h

Website Performance Audits for Kirby CMS

audit.bnomei.com·5h

Leaning Into the Coding Interview: Lean 4 vs Dafny cage-match

ntaylor.ca·1h·

Discuss: Lobsters, Hacker News

🔧Functional Programming

Ai’s Quantum Knowledge Tested: Models Fail 77% Of Core Concept Questions

quantumzeitgeist.com·1d

MiniMaxAI MiniMax-M2.5 has 230b parameters and 10b active parameters

openhands.dev·1d·

Discuss: r/LocalLLaMA

.NET Checker 1.5

majorgeeks.com·20h

🔧Functional Programming

Benchmark Health Index: A Systematic Framework for Benchmarking the Benchmarks of LLMs

arxiv.org·1d

Running an experiment with Claude Code overnight

blog.nolank.ca·3h

Show HN: We built AI to help bidding teams – mentioned by Kunal Bahl on ET Now

contravault.com·4h·

Discuss: Hacker News

snapllm/snapllm: 🔥 🔥 Alternative to Ollama 🔥 🔥 multi-model <1ms LLM switching

github.com·2h·

Discuss: Hacker News

Olmix: A framework for data mixing throughout LM development

allenai.org·1d

🔧Functional Programming

5 Days, One GPU Gameboy Swarm

bkase.io·1d·

Discuss: Hacker News

Design Decision: Technical Debt in BillaBear

iain.rocks·1d·

Discuss: Hacker News, r/programming

🔧Functional Programming

My Skill Makes Claude Code GREAT At TDD

aihero.dev·1d

🔧Functional Programming

CCBench: How do agents perform on codebases that aren't part of training data?

ccbench.org·1d·

Discuss: Hacker News

Conductor Update: Introducing Automated Reviews

developers.googleblog.com·1d·

Discuss: Hacker News

Loading more...