1 Comment

Is there a good primer for understanding mixture-of-experts and how it differs from the original Transformer architecture?

Expand full comment