Preview
Open Original
Overview of the numerous research areas for AI inference optimizations. The goal is to speed up running of the AI model, called inference, so that users get a faster response time, and model owners see a reduced cost from the GPU and other resources required to run AI models online. Well-known optimization techniques include quantization and pruning, but there are many others.