NoiseFormer -- Noise Diffused Symmetric Attention Transformer (opens in new tab)
arXiv:2601.11619v1 Announce Type: cross Abstract: Transformer architecture has been very successful long runner in the field of Deep Learning (DL) and Large Language Models (LLM) because of its powerful attention-based learning and parallel-natured architecture. As the models grow gigantic in terms of memory footprint, difficulties in fitting the model on a device like a GPU or an AI accelerator give rise to the need for multiple computing devices thereby escalating the computing cost. This ...
Read the original article