Significant speedup for local models
github.comΒ·3wΒ·
Discuss: r/LocalLLaMA
Flag this post

Static Functions Can Approximate Deep Attention Layers

This repository contains the code accompanying the paper β€œStatic Functions Can Approximate Deep Attention Layers”. It demonstrates that small fixed MLPs can replace trained Transformer attention blocks with minimal accuracy loss, while improving runtime speed.


🧩 Project Overview

The repository provides:

  • A minimal GPT-like baseline model (train_base_model.py)
  • A mechanism for extracting intermediate representations and training static approximator functions (train_approximators.py)
  • A hybrid model that combines trained Transformer layers with frozen approximators (train_end_to_end.py)
  • A benchmark script comparing runtime performance (benchmark.py)

All experiments use the Tiny Shakespeare datas…

Similar Posts

Loading similar posts...