[2401.04088] Mixtral of Experts (opens in new tab)

Covered by 3 sources including vettedconsumer.com, DEV Community

We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, b...

Read the original article

Sign in to keep reading the full article.

Sign Up Log In

Covered in 3 articles

vettedconsumer.com·

[2401.04088] Mixtral of Experts (opens in new tab)

Covered in 3 articles

Mixture-of-Experts (MoE), Explained: Why “Active Parameters” Decide What Runs on Your Machine

Gemma 4 dense by default: why your local agent doesn't want the MoE

How LLMs Actually Work: A Friendly Map for Humans • oreoro