BLOOM (BigScience Language Open-science Open-access Multilingual): the BigScience 176 billion parameters model is currently training.
The training started on March 11, 2022 11:42am PST and will last 3-4 months on the 416 A100 GPUs of the Jean Zay public supercomputer
Follow the training at https://twitter.com/BigScienceLLM
And the tensorboard, detailed model card and intermediate checkpoints at https://hf.co/bigscience/bloom
Send questions about the training to bigscience-large-model-training [AT] googlegroups [.] com
More information on the BigScience project: start here BigScience
Summary:
- The model:
- 176B parameters decoder-only architecture (GPT-like)
- 70 layers - 112 attention heads per layers - hidden dimensionality of 14336 - 2048 tokens sequence length
- ALiBi positional embeddings - GeLU activation function
- More information:
- The dataset:
- The engineering side:
- number of GPU used for the training: 384 A100 GPU with 80 Gb of memory each
- one copy of the model takes 48 GPUs (using 60 GB of memory on each GPU)
- checkpoint size: only the bf16 weights are 329GB, the full checkpoint with optimizer states is 2.3TB
- training throughput: about 150 TFLOPs
- estimated training time: 3-4 months depending on throughput and unexpected events
- More information:
- Environmental considerations
- Jean Zay, the supercomputer we are using for model training, is mostly powered by nuclear energy, which is a low carbon energy source.
- Significant efforts were made to make sure that the computing infrastructure is as efficient as possible — the heat generated by the hardware even gets used for heating buildings on campus!
- More information:
- We are currently working on making a precise estimate of the carbon emitted during all of the steps of model training, including intermediate experiments as well as inference.
- More soon!