📊 LLM Evaluation - amy_yunduo · Scour

⚙️Backend Engineering wowhow.cloud·

Claude Opus 4.8 vs Gemini 3.5 Pro vs GPT-5.6: Developer Model Selection Guide (June 2026)

Discussed on DEV

🧠LLMs arXiv·

Investigating Linguistic Steering: An Analysis of Adjectival Effects Across Large Language Model Architectures

🔄MLOps arXiv·

You Don't Need to Run Every Eval

🏗️AI Infra arXiv·

Uncertainty-based Debiasing and Unlearning for Decontamination

🎯Post-training arXiv·

Weight-Space Geometry of Offline Reasoning Training

🧠LLMs arXiv·

In LLM Reasoning, there is Irrationality on top of Value Misalignment

🎯Post-training arXiv·

The Geometry of Sequential Learning: Lie-Bracket Prediction of Transfer Order

🎯Post-training arXiv·

SARA: Unlocking Multilingual Knowledge in Mixture-of-Experts via Semantically Anchored Routing Alignment

🧠LLMs arXiv·

Beyond Fixed Budgets: Characterizing the Inelasticity and Limitations of Tree-of-Thought Reasoning Strategies

🎯Post-training arXiv·

L20-Edu-135M: An Auditable Single-GPU Study of Data-Efficient Small Language Modeling

🧠LLMs arXiv·

Scheduling Thoughts: Learning the Order of Thought in Diffusion Language Models

Log in to enable infinite scrolling