Lag-Llama is based on the recent decoder-only transformer-based LlaMA architecture which incorporates pre-layer normalization via the RMSNorm and adds Rotary Positional Encoding (RoPE) to each attention layer’s query and key representations.
At inference time, given a time series of sufficient length, we construct a feature vector that is passed to the model to obtain the distribution of the next time point.
We generate long term predictions via greedy autoregressive decoding, using previously generated samples of predicted values to generate further predictions up to our chosen prediction horizon P ≥ 1.
Fig 2: The input at each timestep is the value of a univariate time series at a given timestep xⁱₜ concatenated with its covariate vector cⁱₜ. The model projects the inputs through M masked decoder layers. The features are then mapped to the parameters of a probability distribution through the distribution head.
Our tokenization method uses lag features (specific points in the history) and date-time features (indicating the timestamp). We develop stratified batch sampling and normalization strategies which prove effective during pretraining.
Our pretraining corpus consists of 27 datasets across six different domains such as energy, transportation, economics, nature, air quality and cloud operations, with close to 8K univariate time series and 352M tokens.
Acknowledgements: OLCF, ServiceNow, Morgan Stanley, CERC-AAI etc.
Forecasts on the unseen datasets closely match the ground truth.