BLOOM (BigScience Language Open-science Open-access Multilingual): the BigScience 176 billion parameters model is currently training.

The training started on March 11, 2022 11:42am PST and will last 3-4 months on the 416 A100 GPUs of the Jean Zay public supercomputer

Follow the training at https://twitter.com/BigScienceLLM And the tensorboard, detailed model card and intermediate checkpoints at https://hf.co/bigscience/bloom Send questions about the training to bigscience-large-model-training [AT] googlegroups [.] com

More information on the BigScience project: start here BigScience

Summary:

The model:
- 176B parameters decoder-only architecture (GPT-like)
- 70 layers - 112 attention heads per layers - hidden dimensionality of 14336 - 2048 tokens sequence length
- ALiBi positional embeddings - GeLU activation function
- More information:
  - Blog post summarizing how the architecture, size, shape, and pre-training duration where selected: https://bigscience.huggingface.co/blog/what-language-model-to-train-if-you-have-two-million-gpu-hours
  - More details on the architecture/optimizer: https://github.com/bigscience-workshop/bigscience/tree/master/train/tr11-176B-ml
The dataset:
- Multilingual: 46 languages: Full list is here: https://bigscience.huggingface.co/blog/building-a-tb-scale-multilingual-dataset-for-language-modeling
- 341.6 billion tokens (1.5 TB of text data)
- Tokenizer vocabulary: 250 680 tokens
- More information:
  - Blog post detailing the design choices during the dataset creation: https://bigscience.huggingface.co/blog/building-a-tb-scale-multilingual-dataset-for-language-modeling
The engineering side:
- number of GPU used for the training: 384 A100 GPU with 80 Gb of memory each
- one copy of the model takes 48 GPUs (using 60 GB of memory on each GPU)
- checkpoint size: only the bf16 weights are 329GB, the full checkpoint with optimizer states is 2.3TB
- training throughput: about 150 TFLOPs
- estimated training time: 3-4 months depending on throughput and unexpected events
- More information:
  - Blog post on the hardware/engineering side: https://bigscience.huggingface.co/blog/which-hardware-to-train-a-176b-parameters-model
  - Details on the distributed setup used for the training: https://github.com/bigscience-workshop/bigscience/tree/master/train/tr11-176B-ml
  - Tensorboard updated during the training: https://huggingface.co/bigscience/tr11-176B-ml-logs/tensorboard#scalars&tagFilter=loss
  - Details on the obstacles overcome during the preparation on the engineering side (instabilities, optimization of training throughput, so many technical tricks and questions): https://github.com/bigscience-workshop/bigscience/blob/master/train/tr11-176B-ml/chronicles.md
Environmental considerations
- Jean Zay, the supercomputer we are using for model training, is mostly powered by nuclear energy, which is a low carbon energy source.
- Significant efforts were made to make sure that the computing infrastructure is as efficient as possible — the heat generated by the hardware even gets used for heating buildings on campus!
- More information:
  - We are currently working on making a precise estimate of the carbon emitted during all of the steps of model training, including intermediate experiments as well as inference.
  - More soon!