Continual Learning of Foundation Models:
CL-FoMo Suite of 9.6B and 410M LLMs


This is a progress update (stay tuned - next update mid-January)  on our CL-FoMo (Continually-Learned Foundation Models) suite of open-source LLMs, containing four 410M models, and four 9.6B models, trained on Pile, SlimPajama (SP), Mix Pile+SP, and Continual (Pile, SP). On most benchmarks, our 9.6B continual model is competitive with the performance of the baseline joint model trained on mixed Pile+SP data (we call it IID Pile+SP to underscore that samples in the mix are identically distributed), and  attains similar performance to RedPajama 7B and Pythia 12B on featured benchmarks. 

See also:

Presentation given by the team at  the 6th  Scaling workshop:  Continual Pretraining of Foundation Models.
Paper (extended version TBA in Jan): Continual Pre-Training of Large Language Models: How to (re)warm your model, in Efficient Natural Language and Speech Processing (ENLSP-III)  workshop at NeurIPS 2023.

Main points

Our focus is on developing best practices re: learning rate schedule and replay, for continual pretraining of foundation models without the need to start training from scratch each time a new dataset becomes available.   

While this is a work in progress, we already observe that:

Also, we observe that:

Introduction: why continual pretraining?

Large-scale unsupervised pre-trained models, a.k.a. foundation models, are taking the AI field by storm,  as they are beating the prior state-of-art performance by far and achieve  impressive few-shot generalization abilities  on a variety of tasks in multiple domains. These successes become possible due to the fact that performance of foundation models scales consistently with increasing model, data and compute size. Neural scaling laws  predicting such  capability advances are being observed and characterized in various recent works.
While the initial trend (2020-2021) was primarily on increasing the model size,   followed the original Kaplan’s scaling laws, the more recent trend is towards smaller models and (significantly) larger dataset size (as well as the quality of the training data). The shift started with the discovery of better (compute-optimal) Chinchilla scaling laws by the authors of the Chinchilla paper: Training Compute-Optimal Large Language Models. Furthermore, recent LLaMA models released by Meta further improved the state-of-art by training relatively small model sizes, with even larger than Chinchilla-optimal datasets, and also emphasized the importance of taking into account not only training, but inference efficiency. As a result, for example,  LLaMA-13B performed better than GPT-3 (175B) in most tests or evaluations.

However, even despite the much larger datasets used to train recent models such as LLaMA2 (see below), their performance is far from saturation (see the figure below, on the right). Clearly, the LLaMAs (and other LLM species) are still quite data-hungry, and we can continue to feed them more data to improve their performance (surely, the quality of the data is important - "we are what we eat", LLMs not being an exception). However, there is a clear inefficiency in the way the current models are being trained: countless (open-source) models rapidly expanding the LLM ecosystem are all being pre-trained on billions of tokens, only to restart the process over again once new data becomes available. A much cheaper and efficient solution would be to enable the continual pre-training of these models, i.e. updating pre-trained models with new data instead of re-training them from scratch. However, the distribution shift induced by novel data typically results in degraded performance on past data; moreover, learning from a sequence of dataset may be inferior to learning simultaneously from a mixed dataset with shuffled data.

This project aims to develop best practices that allow the practitioners to learn from (potentially infinite) sequences of continually arriving datasets, while maintaining the model performance close to (or even better than) the performance of the "ideal" model one could obtain from training on all data at once. We summarize the main points and recent results below.

Number of parameters vs tokens seen during training for various models. The number of parameters of each model is indicated in parentheses next to each model.  Source.

Diverse types of new data

https://commoncrawl.github.io/cc-crawl-statistics/plots/crawlsize

Current practice


When a new dataset becomes available, it is merged with existing datasets to create a weighted composition, and models are trained from scratch on this new dataset. This method, although common, proves to be quite expensive. Do we really need to start over each time new data is introduced?


Is there an alternative? Can we instead build "lifelong" learning foundation models?

A viable solution might be Continual Learning! Rather than starting afresh, we can keep training on new data using existing checkpoints, continuously updating the model’s knowledge and broadening its capabilities. This approach is especially beneficial in scenarios where we accumulate new data continuously, such as generalist agents and RLHF.

So, what's the challenge? There are two main questions to address:

Experimental Setup  

Pre-training datasets:  (1) Pile,  (2) SlimPajama (SP) and the (3) union: Pile+SP

Composition of SlimPajama by Category

Settings and baselines

SlimPajama - 300B tokens

Lower bound?

Pile - 300B tokens -> SlimPajama - 300B tokens

Ours

Pile + SlimPajama - 600B tokens

Upper bound?

Model  suite: 410M and 9.6B parameter models  

Research questions

Effect of warm-up duration on validation loss (410M)

Effect of varying the maximum learning rate

 

Continual pre-training from non-converged checkpoints






Forgetting and adaptation: 410M models

Forgetting and adaptation: 9.6B models

Continual Model (Pile, SP) matches the performance of the baseline joint model (Pile+SP)

Our 9.6B parameter model attains similar performance to RedPajama 7B and Pythia 12B on featured benchmarks

Our continually pre-trained 9.6B parameter model assumes that an existing pre-trained model is available (e.g., one of the LLMs on HuggingFace), therefore, its total flops are measured during the continual training phase. The plots compare different models trained with different compute budgets. While the models and the amount of data used is not inherently comparable, we observe that continuing to pre-train from a pre-trained checkpoint achieves similar or stronger performance on the featured benchmarks, despite using fewer flops.

Comparison with other popular open-source models

Conclusions

In a continual pre-training setting,

Our results demonstrate that continual pre-training greatly reduces the cost of producing competitive language models.