Lag-Llama: Towards Time-Series Foundation Models 

TL;DR: Over the past years, foundation models have caused a paradigm shift in machine learning due to their unprecedented capabilities for zero-shot and few-shot generalization. However, despite the success in modalities such as natural language processing and computer vision, foundation models for time series forecasting has lagged behind. We introduce Lag-Llama: the first open-source foundation model for time series forecasting!
Model weights Demo GitHub Paper Tweet

Lag-Llama is a general-purpose foundation model for univariate probabilistic time series forecasting based on a decoder-only transformer architecture that uses lags as covariates. Lag-Llama is pretrained on a large corpus of diverse time series data from several domains, and demonstrates strong zero-shot generalization capabilities compared to a wide range of forecasting models on downstream datasets across domains. Moreover, when fine-tuned on relatively small fractions of such previously unseen datasets, Lag-Llama achieves state-of-the-art performance, outperforming prior deep learning approaches, emerging as the best general-purpose model on average. Lag-Llama serves as a strong contender to the current state-of-art in time series forecasting and paves the way for future advancements in foundation models tailored to time series data.

A LLaMA-like architecture for time series

Lag-Llama is based on the recent decoder-only transformer-based LlaMA architecture which incorporates pre-layer normalization via the RMSNorm and adds Rotary Positional Encoding (RoPE) to each attention layer’s query and key representations.

At inference time, given a time series of sufficient length, we construct a feature vector that is passed to the model to obtain the distribution of the next time point. 

We generate long term predictions via greedy autoregressive decoding, using previously generated samples of predicted values to generate further predictions up to our chosen prediction horizon P ≥ 1.

Fig 2: The input at each timestep is the value of a univariate time series at a given timestep xⁱₜ concatenated with its covariate vector cⁱₜ. The model projects the inputs through M masked decoder layers. The features are then mapped to the parameters of a probability distribution through the distribution head.

Our tokenization method uses lag features (specific points in the history) and date-time features (indicating the timestamp).  We develop stratified batch sampling and normalization strategies which prove effective during pretraining. 

Our pretraining corpus consists of 27 datasets across six different domains such as energy, transportation, economics, nature, air quality and cloud operations, with close to 8K univariate time series and 352M tokens. 

Acknowledgements: OLCF, ServiceNow, Morgan Stanley, CERC-AAI etc.

Forecasts on the unseen datasets closely match the ground truth.