Hardware Acceleration
Less-relevant results
APEX4: Efficient Pure W4A4 LLM Inference via Intra-SM Compute Rebalancing
🖥️GPU Programming Content type: AcademicAutoMegaKernel: A Statically-Checked Agent Harness for Self-Retargeting Megakernel Synthesis
🤖AI Content type: AcademicModeling, Optimizing and Exploring Multi-Die FPGA Routing Architectures
💾Computer Architecture Content type: Academicbigattichouse/packed-twin-inference: PTI achieves ~2× throughput using a single quantized model (Q5_K_M or better) by running 4 generation streams in one batched decode call. The GPU loads model weights once per step and produces 4 predictions simultaneously. KV cache overhead is ~0.8 GiB total for all 4 streams. No draft model. No quality loss
💬LLMs Content type: CodeNo more posts from jhcha.oyo's subscribed feeds.