New Mixture-of-Experts (MoE) Models. MS Phi-2 2.7B Small Model. StripedHyena 7B Models. DeepMind Imagen2. Diffusion Models + XGBoost. promptbase. Automated Continual Learning. CogAgent V-L Model.
Is there a good primer for understanding mixture-of-experts and how it differs from the original Transformer architecture?
Is there a good primer for understanding mixture-of-experts and how it differs from the original Transformer architecture?