Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Close
You're currently offline. Some features may not work.
Close
Copied to clipboard
Close
Unable to share or copy to clipboard
Close
🧠 LLM Inference
Quantization, Attention Mechanisms, Batch Processing, KV Caching
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
20690
posts in
438.6
ms
Understanding LLM Inference
Engines
: Inside
Nano-vLLM
(Part 2)
neutree.ai
·
1d
·
Discuss:
Hacker News
🏗️
LLM Infrastructure
Optimal Bayesian
Stopping
for Efficient Inference of
Consistent
LLM Answers
arxiv.org
·
1d
🏗️
LLM Infrastructure
Mechanistic
Interpretability:
Peeking
Inside an LLM
towardsdatascience.com
·
2d
🏗️
LLM Infrastructure
Optimized
LLM Inference
Engines
rishirajacharya.com
·
3d
🏗️
LLM Infrastructure
Finding the needle in the
logstack
: Reducing LLM context with
TF-IDF
eliseomartelli.it
·
1d
🏗️
LLM Infrastructure
Harmonia
: Algorithm-Hardware Co-Design for Memory- and Compute-Efficient
BFP-based
LLM Inference
arxiv.org
·
2d
🏗️
LLM Infrastructure
AI attention
span
so good it
shouldn
’t be legal
stackoverflow.blog
·
1d
🏗️
LLM Infrastructure
Sequential Attention: Making AI models
leaner
and faster without
sacrificing
accuracy
research.google
·
3d
·
Discuss:
Hacker News
,
r/LocalLLaMA
📦
Batch Embeddings
Prompt injection in Google
Translate
reveals base model
behaviors
behind task-specific fine-tuning
lesswrong.com
·
2h
🛡️
AI Security
ML-LIB
: Machine Learning Library Proposed For The Linux Kernel
phoronix.com
·
20h
·
Discuss:
Hacker News
🏗️
LLM Infrastructure
Continual
learning and the post
monolith
AI era
baseten.co
·
17h
·
Discuss:
Hacker News
🏗️
LLM Infrastructure
Crafting the Eyes for Thinking Machines: Rewiring the
Retina
- The Anatomy of
ViTStruct
pub.towardsai.net
·
13h
📦
Batch Embeddings
How we cut
Vertex
AI latency by 35% with
GKE
Inference Gateway
cloud.google.com
·
22h
🧠
Inference Serving
A
Ralph
Loop for Reading:
Beating
GPT 5.2 with a 4k Context Window (and 4 GPUs)
stevehanov.ca
·
1d
🏗️
LLM Infrastructure
New AI system
pushes
the time
limits
of generative video
techxplore.com
·
1d
✨
Gemini
How I
squeezed
a
BERT
sentiment analyzer into 1GB RAM on a $5 VPS
mohammedeabdelaziz.github.io
·
44m
·
Discuss:
Hacker News
🏗️
LLM Infrastructure
Building Highly Efficient Inference System for
Recommenders
Using
PyTorch
pytorch.org
·
1d
·
Discuss:
Hacker News
🕯️
Candle
A Neuro Symbolic Architecture For Induced
Epistemic
Agency and System 2 Reasoning in
Quantized
Large Language Models
papers.ssrn.com
·
1d
·
Discuss:
Hacker News
🎭
Claude
How
Transformers
Architecture
Powers
Modern LLMs
blog.bytebytego.com
·
4d
🔤
Tokenization
Databricks adds
MemAlign
to
MLflow
to cut cost and latency of LLM evaluation
infoworld.com
·
2d
🏆
LLM Benchmarking
Loading...
Loading more...
Page 2 »
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help