‘MPT-30B’ LLM – Chatbot-like model for dialogue generation

MPT-30B-Chat is a chatbot-like model for dialogue generation. It was built by finetuning MPT-30B on the ShareGPT-Vicuna, Camel-AI , GPTeacher, Guanaco, Baize and some generated datasets. License: CC-By-NC-SA-4. (non-commercial use only) Demo on Hugging Face Spaces. This model was trained by MosaicML and follows a modified decoder-only.

Since the launch of MPT-7B in May, the ML community has eagerly embraced open-source MosaicML Foundation Series models. The MPT-7B base, -Instruct, -Chat, and -StoryWriter models have collectively been downloaded over 3M times!

We’ve been overwhelmed by what the community has built with MPT-7B. To highlight a few: LLaVA-MPT adds vision understanding to MPT, GGML optimizes MPT on Apple Silicon and CPUs, and GPT4All lets you run a GPT4-like chatbot on your laptop using MPT as a backend model.

It expands the MosaicML Foundation Series with MPT-30B, a new, open-source model licensed for commercial use that is significantly more powerful than MPT-7B and outperforms the original GPT-3. In addition, we are releasing two fine-tuned variants, MPT-30B-Instruct and MPT-30B-Chat, that are built on top of MPT-30B and excel at single-turn instruction following and multi-turn conversations, respectively.

All MPT-30B models come with special features that differentiate them from other LLMs, including an 8k token context window at training time, support for even longer contexts via ALiBi, and efficient inference + training performance via FlashAttention. The MPT-30B family also has strong coding abilities thanks to its pre-training data mixture. This model was extended to an 8k context window on NVIDIA H100 GPUs, making it (to the best of our knowledge) the first LLM trained on H100 GPUs, which are now available to MosaicML customers!

The size of MPT-30B was also specifically chosen to make it easy to deploy on a single GPU—either 1x NVIDIA A100-80GB in 16-bit precision or 1x NVIDIA A100-40GB in 8-bit precision. Other comparable LLMs such as Falcon-40B have larger parameter counts and cannot be served on a single datacenter GPU (today); this necessitates 2+ GPUs, which increases the minimum inference system cost.

If you want to start using MPT-30B in production, there are several ways to customize and deploy it using the MosaicML Platform.

MosaicML Training. Customize MPT-30B using your private data via fine-tuning, domain-specific pre-training, or training from scratch. You always own the final model weights, and your data is never stored on our platform. Pricing is per GPU-minute.
MosaicML Inference. Talk to our hosted endpoints for MPT-30B-Instruct (and MPT-7B-Instruct) using our Python API, with standard pricing per-1K-tokens.
We are so excited to see what our community and customers build next with MPT-30B. To learn more about the models and how you can customize them using the MosaicML platform, read on!

MPT-30B Family
Mosaic Pretrained Transformer (MPT) models are GPT-style decoder-only transformers with several improvements including higher speed, greater stability, and longer context lengths. Thanks to these improvements, customers can train MPT models efficiently (40-60% MFU) without diverging from loss spikes and can serve MPT models with both standard HuggingFace pipelines and FasterTransformer.

MPT-30B (Base)
MPT-30B is a commercial Apache 2.0 licensed, open-source foundation model that exceeds the quality of GPT-3 (from the original paper) and is competitive with other open-source models such as LLaMa-30B and Falcon-40B.

Using publicly available LLM Foundry codebase, we trained MPT-30B over the course of 2 months, transitioning between multiple different NVIDIA A100 clusters as hardware availability changed, with an average MFU of >46%. In mid-June, after we received our first batch of 256 NVIDIA H100 GPUs from CoreWeave, we seamlessly moved MPT-30B to the new cluster to resume training on H100s with an average MFU of >35%. To the best of our knowledge, MPT-30B is the first public model to be (partially) trained on H100 GPUs! We found that throughput increased by 2.44x per GPU and we expect this speedup to increase as software matures for the H100.

As mentioned earlier, MPT-30B was trained with a long context window of 8k tokens (vs. 2k for LLaMa and Falcon) and can handle arbitrarily long context windows via ALiBi or with fine-tuning. To build 8k support into MPT-30B efficiently, we first pre-trained on 1T tokens using sequences that were 2k tokens long, and continued training for an additional 50B tokens using sequences that were 8k tokens long.

The data mix used for MPT-30B pre-training is very similar to MPT-7B (see the MPT-7B blog post for details). For the 2k context window pre-training we used 1T tokens from the same 10 data subsets as the MPT-7B model (Table 1), but in slightly different proportions.