Read a white paper from DeepMind on sparse MoE models and it clicked

I was skimming a DeepMind paper from last month about sparse mixture-of-experts models and something finally made sense. They showed a 1.2 trillion parameter model that only uses about 10% of its parameters per token. It changed how I think about why these big models don't just grind to a halt. Has anyone else had a lightbulb moment reading those research papers?

2 comments

2 Comments

grant.margaret5d ago

You ever notice how routing in MoE works like a smart thermostat? It only fires up the zones that need heating, not the whole house. The model learns which "expert" handles nouns or verbs or logic, and sends the token to the right one. That 90% idle is just sitting there ready for the next token, like spare capacity in an HVAC system waiting for a heat wave. Made me realize half the battle is just keeping the right parts active. Saves compute like crazy.

blakem825d ago

Respectfully, I gotta push back on that thermostat comparison. MoE routers don't really "choose" the right expert the way a thermostat picks a zone, they just guess based on patterns and sometimes totally misfire. And that 90% idle isn't really spare capacity waiting for load, it's just wasted memory sitting there doing nothing until the router happens to need it.