We would like to thank the Oak Ridge Leadership Computing Facility (OLCF) for granting us access to their Summit and Frontier supercomputers for the realization of these projects as part of the INCITE compute grant supported under Contract DE-AC05-00OR22725 by the DoE Office of Science User Facility.  As a part of the collaborative INCITE effort, several project tracks are being pursued - see the details below.

 For an overview, see the talks by our  team  at the  6th Neural Scaling Workshop in Dec 2023:

A talk by Irina Rish  on  Complex systems view of large-scale AI systems   (video) at  Alignment Workshop (Dec 2023).

Project 1: Pretraining LLMs From Scratch 

Together Blog: Releasing 3B and 7B RedPajama-INCITE family of models including base, instruction-tuned & chat models

In April-May 2023, we have used Summit to train and release two open-source LLMs,  trained on 1T tokens each, using the RedPajama dataset. These two models, RedPajama-INCITE-3B and RedPajama-INCITE-7B, were then fine-tuned externally into chat and instruction-tuned versions, respectively. They are now each downloaded thousands of times per month, and have been ported to (and are currently running on)  various environments, including  Apple iPhone.  We are excited to report that the RedPajama-INCITE-3B has been the best performing 3-billion parameter model since the time of release. After fine-tuning, the RedPajama-INCITE-3B model is even competitive with the 7B-parameter Llama model. Similarly, after fine-tuning, the RedPajama-INCITE-7B model was the best-performing 7B-parameter model at the time of release. 

Overall, the Summit supercomputer was critical for training and releasing these models. The compute time on Summit allowed us to validate that it is possible to collect a high-quality pretraining dataset using public resources, and pretrain a competitive language model. Furthermore, this work has helped maintain an open ecosystem for large language models, and made its contribution towards AI democratization via open-source, which is a continuing effort, with more and more open LLMs made available as time progresses (the most recent examples include LLMs such as Falcon, MPT, and Llama-2).

Team: Kshitij Gupta, Benjamin Therien, Adam Ibrahim, Mats Leon Richter, Quentin Anthony, Eugene Belilovsky, Irina Rish, Timothee Lesort 

We trained three Pythia-like models but with 9B parameters each, on three different datasets: Pile, SlimPajama, and a mix of both.

Project  2: Continual Pretraining  of LLMs

Team: Kshitij Gupta, Benjamin Therien, Adam Ibrahim, Mats Leon Richter, Quentin Anthony, Eugene Belilovsky, Irina Rish, Timothee Lesort 

Presentations  given by the team:

The 6th  Scaling workshop:  Continual Pretraining of Foundation Models

AI @ Scale: Continual Pretraining of Foundation Models  
Paper: "Simple and Scalable Strategies to Continually Pre-train Large Language Models" (submitted).

Earlier version: Continual Pre-Training of Large Language Models: How to (re)warm your model, in Efficient Natural Language and Speech Processing (ENLSP-III)  workshop at NeurIPS 2023.

Blog: Continual Learning of Foundation Models:  CL-FoMo Suite of 9B and 420M LLMs

Project Summary:  The main focus of this project is on developing strategies to continuously pre-train models on new datasets, so as to avoid retraining models from scratch on new datasets.  We study the impact of warmup, learning rate and replay strategies on model performance when training models on IID datasets vs sequential datasets.
The access to the Summit Supercomputer was essential to test the impact of these different strategies at scale (410M & 9B parameter models, 600B tokens from The Pile (300B) and SlimPajama (300B)). From this research, we know that we can increase the learning rate ("rewarming") of a pretrained model and then decay it to improve performance on new datasets while minimizing forgetting on old datasets. We can therefore rewarm the learning rate of old checkpoints to continue pretraining on new datasets, enabling transfer between upstream and downstream tasks to the point of outperforming models optimized directly for downstream tasks. This work ultimately contributes to the democratization of AI by showing a way forward to drastically reduce the cost of maintaining the relevancy of foundation models as new datasets are collected.

Project 3: Aligned Multimodal Models 

Project 3.1: Developing Robin Suite of Vision-Language Models

Team (Alphabetical Order)   Alexis Roger, Andrew R Williams, Daniel Kaplan, Edwin Fennell, George Adamopoulos, Kshitij Gupta, Prateek Humane, Quentin Anthony, Rishika Bhagwatkar, Sun Qi, Yuchen Lu, Irina Rish

Blog: Release of Robin v1.0 - a Suite of Multimodal Models

Project Summary:  This project focuses on multimodal models, and more precisely Visual Language Models, which take an image and text as input and output text. We train and evaluate models based on many different large language models (Pythia, GPT, Mistral...) and study their combination with various visual encoders, such as CLIP and SigLIP, within different architectures.  We also have a keen interest in the alignment and ethics of these models.

Project 3.2: Aligning Vision-Language Models
Blog: TBA
Papers & presentations:

Project Summary:   TBA

Project 4: Time-Series Foundation Models 

Team: Kashif Rasul, Arjun Ashok, Andrew Robert Williams, Arian Khorasani, George Adamopoulos, Rishika Bhagwatkar, Marin Biloš, Hena Ghonia, Nadhir Hassen, Anderson Schneider, Sahil Garg, Alexandre Drouin, Nicolas Chapados, Yuriy Nevmyvaka, Irina Rish

Project Summary: Over the past years, foundation models have caused a paradigm shift in machine learning due to their unprecedented capabilities for zero-shot and few-shot generalization. However, despite the success in modalities such as natural language processing and computer vision, foundation models for time series forecasting has lagged behind (pun intended :)

We introduce Lag-Llama: the first open-source foundation model for time series forecasting!

Paper: Lag-Llama: Towards Foundation Models for Time Series Forecasting, an extended version of our workshop paper in R0-FoMo:Robustness of Few-shot and Zero-shot Learning in Large Foundation Models workshop at NeurIPS 2023.