Quantization, Attention Mechanisms, Batch Processing, KV Caching
Shrinking LLMs With Self-Compression
semiengineering.comΒ·7h
Economics of Claude 3 Inference
lesswrong.comΒ·20h
A Conversation with Val Bercovici about Disaggregated Prefill / Decode
fabricatedknowledge.comΒ·18h
Using a Framework Desktop for local AI
frame.workΒ·19h
AI cloud infrastructure gets faster and greener: NPU core improves inference performance by over 60%
techxplore.comΒ·16h
Efficient MultiModal Data Pipeline
huggingface.coΒ·14h
How I use LLMs to learn new subjects
seangoedecke.comΒ·14h
Detailed Study of Performance Modeling For LLM Implementations At Scale (imec)
semiengineering.comΒ·18h
Energy-Based Transformers are Scalable Learners and Thinkers
lesswrong.comΒ·24m
Loading...Loading more...