The economics of speculative decoding (opens in new tab)
Two underexplored axes: what MoE routing does to the decode roofline, and how compressed attention takes away the slack that used to make speculated tokens free.
Read the original articleTwo underexplored axes: what MoE routing does to the decode roofline, and how compressed attention takes away the slack that used to make speculated tokens free.
Read the original article