This AI Paper Presents a Direct Experimental Comparison between 8B-Parameter Mamba, Mamba-2, Mamba-2-Hybrid, and Transformer Models Trained on Upto 3.5T Tokens

Transformer-based Large Language Models (LLMs) have emerged as the backbone of Natural Language Processing (NLP). These models have shown remarkable performance over a variety of NLP tasks. The creative self-attention mechanism that enables effective all-to-all communication between tokens in a sequence is primarily responsible for their success. Transformers have become a leading NLP research tool because of this approach and its capacity to expand both model and dataset sizes.

However, self-attention layers are not without restrictions, especially when working with lengthy sequences. The self-attention computational load grows quadratically with the sequence length during training. A large key-value cache is required to hold the state since the memory demand at inference time increases linearly with the number of previous tokens. Numerous attempts have been made to optimize self-attention layers in response to these efficiency difficulties. Still, these attempts are not up to the language modeling power of conventional self-attention.

Selective state-space models (SSMs) such as Mamba solve some of the fundamental limitations associated with Transformers. Because of the key-value cache, transformers have quadratic computational complexity in relation to sequence length and high memory requirements during inference. SSMs provide a better, more effective solution by reducing these problems. Recent studies have shown that SSMs can compete with Transformers, if not outperform them, in language modeling tasks, making them a reasonable alternative.

Previous studies comparing SSMs and Transformers have mostly focused on small-scale trials using models with less than 3 billion parameters and training on datasets smaller than 1 trillion tokens, despite the good results. A team of researchers has recently performed a thorough comparison using 8-billion-parameter models of Mamba, Mamba-2, and Transformers, all trained on datasets up to 3.5 trillion tokens, in order to properly comprehend the performance of these architectures at greater sizes. 

The team has also incorporated an 8-billion-parameter hybrid model, called Mamba-2-Hybrid that consists of 50% MLP layers, 7% self-attention, and 43% Mamba-2. To find out if Mamba models could compete with Transformer models when given more training resources, the team evaluated them across a wide range of natural language tasks. The results showed that on several tasks, pure SSM models, including Mamba and Mamba-2, either matched or outperformed Transformers. 

However, these models failed on tasks that required considerable long-context reasoning and tasks that required strong copying or in-context learning, like the five-shot MMLU and Phonebook Lookup tasks. On all 12 assessed standard tasks, the 8-billion-parameter Mamba-2-Hybrid model outperformed the 8-billion-parameter Transformer, with an average improvement of 2.65 points. During inference, the hybrid model demonstrated the capacity to generate tokens up to eight times faster.

The team has expanded their studies to incorporate versions of the Mamba-2-Hybrid and Transformer models that allow sequence lengths of 16K, 32K, and 128K in order to evaluate long-context capabilities further. The hybrid model continued to perform on par with or better than the Transformer on average across 23 additional long-context tasks.  As part of NVIDIA’s Megatron-LM project, the team has released code.


Check out the Paper and Code. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 44k+ ML SubReddit

Tanya Malhotra is a final year undergrad from the University of Petroleum & Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a Data Science enthusiast with good analytical and critical thinking, along with an ardent interest in acquiring new skills, leading groups, and managing work in an organized manner.

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...