Recently, a leak has surfaced regarding the architecture of GPT-4. This leak provides some intriguing details that shed light on the advancements and improvements made in GPT-4. In this article, we will delve into the leaked information and explore the key features of GPT-4's architecture. Here’s an archive link (the original Twitter thread was deleted).
TLDR:
GPT-4 boasts ~1.8 trillion parameters, easily dwarfing GPT-3 by over 10 times
A mixture of 16 experts (MoE) (scroll down for an explanation on this) model was used
During inference, not all tokens are visible to each expert
Subsequently, each inference uses only ~280B parameters
Training data comprised a staggering 13 trillion tokens
2 epochs for text-based dataÂ
4 epochs for code-based data
Training costs approximately topped 63 million
An intriguing, albeit unconfirmed theory, suggests a hand-curated dataset of college textbooks were converted into text
Are you finding Let’s Talk Text compelling? Help me reach a larger audience by sharing this newsletter with a friend. Just forward this email. To stay updated, subscribe here:
Unpacking mixture of experts
MoE is not necessarily a novel technique. I suggest reading Google's blogs on GLaM (a trillion weight model with sparse MoE) and expert choice routing for a bit of background.
The parameter count for a LLM is a tricky balancing act. Size boosts performance, while cost admittedly favors economically-sized models. In a MoE architecture, individual subsets of the transformer model are trained on specific tasks. This streamlined approach lets individual subnetworks specialize on specific tasks, such as coding problems (e.g., GPT-4 training on codes from GitHub). This not only conserves energy but also potentially sharpens results. The concept can be likened to a multilingual translator with experts trained on different languages:
To dive deeper, here's how inference could potentially operate:
Each input token passes through an encoder with self-attention and mixture of expert layers
The MoE layer possesses a gating module that designates which experts will contribute during a section of the forward pass (e.g., coding experts will be selected for coding issues).
Selected experts then receive the representation from the self-attention layer and feed it through the expert network.
Google paints an excellent picture of this basic inference architecture. The routing algorithm selects the suitable expert (as denoted by the 'FFN #' below) and pushes the representation through said network.
Here’s a couple key points:
Sparse routing employs one or few experts for each token input
Optimizing the gating network to direct tokens to the most fitting expert(s) is crucial. Previous implementations of sparse routing include: