Model optimizations in LLMs

Feeds to Scour
SubscribedAll
Scoured 48 posts in 7.0 ms

Joint Structural Pruning and Mixed-Precision Quantization for LLM Compression

 📊AI Performance Profiling  Content type: Academic
arxiv.org·

Train Models Faster with JAX and MaxText Using NVFP4 on NVIDIA Blackwell

 🧠Large Language Models (LLMs)  Content type: News  Content type: Blog
developer.nvidia.com·

SPEAR: A System for Post-Quantization Error-Adaptive Recovery Enabling Efficient Low-Bit LLM Serving

 💬Prompt optimizations for LLM serving  Content type: Academic
arxiv.org·

TileFuse: A Fused Mixed-Precision Kernel Library for Efficient Quantized LLM Inference on AMD NPUs

 🔧Systems-level optimizations for LLM serving  Content type: Academic
arxiv.org·

Understanding Quantization-Aware Training: Gradients at Quantized Weights Bias to the Low-Loss Basin

 🔢Quantization of LLMs  Content type: Academic
arxiv.org·

Value-and-Structure Alignment for Routing-Consistent Quantization of Mixture-of-Experts Models

 🔢Quantization of LLMs  Content type: Academic
arxiv.org·

Quantized Stochastic Primal-Dual Methods for Distributed Optimization under Relaxed Global Geometry

 🔢Quantization of LLMs  Content type: Academic
arxiv.org·

On Low-Bit Quantization Errors in Speaker Verification: Diagnostic and Mitigation

 🔢Quantization of LLMs  Content type: Academic
arxiv.org·

NSVQ: Mitigating Codebook Collapse by Stabilizing Encoder Drift in Vector Quantization

 🔢Quantization of LLMs  Content type: Academic
arxiv.org·

OffQ: Taming Structured Outliers in LLM Quantization by Offsetting

 🔢Quantization of LLMs  Content type: Academic
arxiv.org·

ASH: Asymmetric Scalar Hashing With Learned Dimensionality Reduction for High-Fidelity Vector Quantization

 🔍Retrieval-augmented generation  Content type: Academic
arxiv.org·

Holding the FP8 Quality Ceiling at 8-Bit Weights and Activations: INT8 and GGUF Post-Training Quantization of Ideogram 4.0 for Consumer GPUs

 🔢Quantization of LLMs  Content type: Academic
arxiv.org·

Decision-Aware Memory Cards: Counterfactual-Inspired Context Selection and Compression for Tool-Using LLM Agents

 🔢Quantization of LLMs  Content type: Academic
arxiv.org·

Minimizing the Hidden Cost of Scales: Graph-Guided Ultra-Low-Bit Quantization for Large Language Models

 🧠Large Language Models (LLMs)  Content type: Academic
arxiv.org·

What Limits Does Quantization Place on Dense Top-$k$ Retrieval? A Theoretical Study

 🔢Quantization of LLMs  Content type: Academic
arxiv.org·

Closing the Indexing-Decoding Gap in Multimodal Generative Retrieval via Prefix Retention Optimization

 🔍Retrieval-augmented generation  Content type: Academic
arxiv.org·

ViP-VL: Vietnamese Self-supervised Speech Pretraining Model with Vector-Quantization Learning

 🧠Large Language Models (LLMs)  Content type: Academic
arxiv.org·

FQA: A Full-Space Quantization-Driven Architecture for Hardware-Efficient Piecewise Approximation of Nonlinear Activation Functions

 🔢Quantization of LLMs  Content type: Academic
arxiv.org·

APEX4: Efficient Pure W4A4 LLM Inference via Intra-SM Compute Rebalancing

 🔧Systems-level optimizations for LLM serving  Content type: Academic
arxiv.org·

Beyond Per-Token Pricing: A Concurrency-Aware Methodology for LLM Infrastructure Cost Estimation

 📊AI Performance Profiling  Content type: Academic
arxiv.org·

Keyboard Shortcuts

Navigation

Next / previous item
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help