High-performance LLM inference
modal.com·1w
🏗️LLM Infrastructure
Preview
Report Post

This high-level guide documents the key techniques used to achieve high performance when running LLM inference on Modal.

Open weights models and open source inference engines have closed much of the gap with proprietary models and proprietary engines and continue to improve as they attract work from a broad community. It is now and will increasingly be economical to run many generative AI applications in-house, rather than relying on external providers.

Achieving competitive performance and cost is not instantaneous, however. It requires some thought and tuning. And LLM inference is in many ways quite different to the web serving and database workloads that engineers are used to deploying and optimizing.

This guide collects techniques we have seen work in production inference depl…

Similar Posts

Loading similar posts...

Keyboard Shortcuts

Navigation
Next / previous item
j/k
Open post
oorEnter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help