📊 LLM Evaluation - gilesr · Scour

When LLMs get significantly worse: A statistical approach to detect model degradations

arxiv.org·2d

PELLI: Framework to effectively integrate LLMs for quality software generation

arxiv.org·2d

How Today’s AI Models Are Leaving Enterprises in the Dark

modernghana.com·21h

Leetcode for ML

pixelbank.dev·4h

Here’s Our First Gemini Deep Think LLM-Assisted Hardware Design

blog.adafruit.com·4h

Ask HN: What explains the recent surge in LLM coding capabilities?

news.ycombinator.com·3h·

Discuss: Hacker News

Data Engineering for Large Models: Architecture, Algorithms & Projects

github.com·1d·

Discuss: Hacker News

Challenges of revision control in the LLM era

gist.github.com·9h·

Discuss: Hacker News

Reflections on prototyping a sysadmin benchmark

samek.fyi·1d

AgentRE-Bench: Can LLM Agents Reverse Engineer Malware?

agentre-bench.ai·1d·

Discuss: Hacker News

Comprehensive Code Review

agenticoding.ai·5h

Securing LLM Applications: Using LLM-as-a-Judge to Block Prompt Injection Attacks

infosecwriteups.com

·1d

Are Multi-Agent LLM Workflows Quietly Amplifying Mistakes?

medium.com

·1d·

Discuss: DEV

Study: Platforms that rank the latest LLMs can be unreliable

digitalinformationworld.com·3d

A New LLM System for Synthesis Planning

science.org·1d

LLMs will either be the best or worst thing to happen to software engineering. They will free us from whittling programs by hand. But will we use that freedom t...

bsky.app·1d·

Discuss: Bluesky

The role of large language models in emergency care: a comprehensive benchmarking study

nature.com·1d

Building an ARC-2 Solver — From Socratic Panels to a Single Oracle

pub.towardsai.net

·1d

Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration

machinelearning.apple.com·2d

🤖AI Agents Weekly: GPT-5.3-Codex-Spark, GLM-5, MiniMax M2.5, Recursive Language Models, Harness Engineering, Agentica, and More

nlp.elvissaravia.com

·12h

Loading more...