Tiny Time Mixers(TTMs): Powerful Zero/Few-Shot Forecasting Models by IBM
A new lightweight open-source foundation model - tutorial included
If you follow the latest research on LLMs, you will notice 2 main approaches:
First, researchers focus on building the largest models possible. Pretraining on next-word prediction is crucial for enhancing performance (and where the millions of dollars are spent!)
Second, researchers use techniques like quantization to create smaller and faster models — while maintaining strong general performance.
However, interesting things happen when smaller models outperform much larger ones in some tasks. For example, Llama 3-8B outperformed the larger Llama 2-70B on the MMLU task!
Tiny Time Mixers (TTM)[1], introduced by IBM, follows the second approach. It’s a lightweight model that outperforms larger SOTA models — including MOIRAI and TimesFM. Plus, it’s open-source!
This article discusses:
The architecture and functionality of TTM.
The innovative features that make TTM exceptional.
Benchmark results comparing TTM with other models.
Let's get started!
Find the full TTM project tutorial here: Get Project #4
Enter Tiny Time Mixer(TTM)
TTM is a lightweight, MLP-based foundation TS model (≤1M parameters) that excels in zero-shot forecasting, even outperforming larger SOTA models.
Key characteristics of TTM:
Non-Transformer Architecture: TTM is extremely fast because there’s no Attention mechanism — it only uses fully-connected NN layers.
TSMixer Foundation: TTM leverages TSMixer[2] (IBM’s breakthrough time-series model) in its architecture.
Rich Inputs: Capable of multivariate forecasting, TTM accepts extra channels, exogenous variables, and known future inputs, enhancing its forecasting versatility.
Fast and Powerful: TTM-quick variant was pretrained on 244M samples of the Monash dataset, using 6 A100 GPUs in less than 8 hours.
Superior Zero-Shot Forecasting: TTM is pretrained and can readily be used for zero-shot forecasting, surpassing larger SOTA models on unseen data, including the recent foundation models such as MOIRAI and TimesFM!
Important Notes:
Note 1: There’s a similar model by Google, also called TSMixer — which was published a few months later! Interestingly, Google’s TSMixer is also an MLP-based model and achieves significant performance! In this article, we will only refer to IBM’s TSMixer[2].
Note 2: IBM’s TSMixer (where TTM is based on) applies softmax after a linear projection to calculate importance weights — which are then multiplied with hidden vectors to upscale or downscale each feature. The authors call this operation Gated Attention — but it’s typically not a traditional multi-head attention with queries, keys, values and multiple heads. Therefore, neither TSMixer (or TTM that uses TSMixer) are characterized as Transformer-based models.
TTM Innovations
TTM introduces several groundbreaking features:
Multi-level modeling: TTM is first pretrained in a channel-independent way(univariate sequences) and uses cross-channel mixing during finetuning to learn multivariate dependencies.
Adaptive patching: Instead of using a single patch length, TTM learns various patch lengths across different layers. Since each time series performs optimally at a specific patch length, adaptive patches help the model generalize better across diverse data.
Resolution Prefix Tuning: Different frequencies (e.g. weekly, daily data) are challenging for foundation time-series models. TTM uses an extra embedding layer to encode time-series frequency — enabling the model to condition its predictions accurately based on the signal’s frequency.
Horizon AI Forecast is a reader-supported newsletter, featuring occasional bonus posts for supporters. Your support through a paid subscription would be greatly appreciated.
Tiny Time Mixers - Architecture
TSMixer is a precursor to TTM. TSMixer is a solid model, but it cannot be used as a foundation model or handle external variables.
TTM uses TSMixer as a building block — and by introducing new features, the authors created a non-Transformer model that generalizes on unseen data.
The architecture of TTM is shown in Figure 1. We’ll describe both phases, pretraining(left) and finetuning(right):
Semantics: sl=context_size, fl=forecasting_length, c = number of channels (input features), c’= number of forecasting channels.
Note: If we have 5 input time-series, but 2 of them are known in the future (e.g
time_of_week,
holiday
) then we have c=5 and c’=5-2 = 3, because the two of them are known and don’t need to be estimated.
Pretraining
During pretraining, the model is trained with univariate time-series only.
First, we normalize per individual time-series. The final outputs at the end are reverse-normalized (a standard practice).
Patching, a widely successful technique in time-series is also used here. Univariate sequences are split into n patches of size pl.
The TTM backbone module applies Adaptive patching and projects the patches from size p to hf. The TTM backbone is the heart of TTM and we’ll explain it later in detail.
The TTM decoder has the same architecture as the TTM backbone — but it’s much smaller, with 80% fewer parameters.
The Forecast linear head contains 1 fully-connected layer and produces the final forecasts (which are then reverse-normalized).
MSE loss is calculated over the forecast horizon fl.
Fine-Tuning
Here, the TTM backbone remains frozen and we only update the weights of the TTM decoder and Forecast linear head.
We can perform few-shot forecasting(train on only 5% of the train data) or full-shot forecasting (on the whole dataset).
We can use multivariate datasets in the fine-tuning phase. In that case, channel-mixing gets enabled in the TTM decoder.
Optionally, we can also activate the exogenous mixing block (shown in Figure 1) to model future known variables.
We just described the Multi-level modeling method of TTM: The model learns the temporal dynamics during pretraining in a channel-independent way, and inner-correlations between time series during finetuning.
TTM Backbone
The core component of TTM is the TTM Backbone — which enables Resolution Prefix Tuning and Adaptive Patching.
Let's zoom into this component to understand its functionality (displayed in Figure 2):
The embedding layer projects the patches from size pl to create the input embeddings of size hf.
Also, the Resolution Prefix Tuning module creates an embedding of size hf that represents the time-frequency/granularity — which is then concatenated with the input embedding (notice the n=n+1 operation in Figure 2).
The TTM backbone consists of levels — each containing TSMixer blocks. Figure 2 shows a TTM backbone of 3 levels, each with 2 TTM blocks.
The TTM block contains 3 sub-modules: the patch partition module, a vanilla TSMixer block, and a patch merging block:
The patch partition module increases the number of patches by K and decreases the patch length by K again. For example, in the first level, the input of size [c,n, hf] becomes [c, 4*n, hf//4].
The TSMixer block is applied to the transformed input.
Finally, the patch merging block reshapes the [c, 4*n, hf//4] input back to [c,n, hf].
Thus, given L levels (from i=1 to L) we have K = 2^(L-i). In Figure 2 we have 3 TTM backbone levels, so :
K_level1 = 2^(3-1) = 4,
K_level2 = 2^(3-2) = 2,
K_level3 = 2^(3-1) = 1
Therefore, by using a different K at each level, channel-mixing is applied across different patches with different lengths. This is the Adaptive Patching process - which helps the model generalize on unseen data.
Exogenous Mixer
If we have future known variables, we can activate the Exogenous Mixer. This module is shown in Figure 3, and its position in the TTM architecture is shown in Figure 1:
The Exogenous Mixer block is straightforward: When the future values of time-series are known (y3 and y4 in Figure 3, green color), they are used to guide the predictions of the target variables (y1 and y2 in Figure 4, purple color).
TTM Training Details and Datasets
The authors created 5 TTM versions for different parameters size, context sl, forecasting lengths fl. These are:
TTM-Base (TTM_B): 1M parameters, sl = 512, pl = 64
TTM-Enhanced (TTM_E): 4M parameters, sl = 1024, pl = 128
TTM-Advanced (TTM_A): 5M parameters, sl = 1536 and pl = 128
Quick-TTM (TTMQ): There are 2 variants here: sl/pl = (512,96) and (1024,96)
The first 3 variants were pretrained using a subset of ∼1B samples from Monash and Libcity, using 6 A100 GPUs and taking 24-30 hours.
The quick version was pretrained on 244M samples of the Monash dataset, using 6 A100 GPUs in less than 8 hours.
There’s a configuration to use a shorter forecasting length for any of the above pretrained models - we show that in the project tutorial here: Project #4
Also, the authors used another dataset to evaluate the efficacy of the exogenous mixer block and investigate how much the performance improves by adding known future variables.
You can find more about these datasets in the original paper. Below are the training hyperparameters for the (512,96) variant:
pl(patch_length) = 64
number of backbone levels = 6
number of TTM blocks per level = 2
batch_size = 3K
epochs = 20
For more details about the training and fine-tuning configurations, please refer to the original paper.
Evaluation Benchmarks
TTM-quick vs SOTA models
Next, the authors evaluated both versions of TTM-quick (Zero-shot and 5% Few-shot) against other SOTA models, including foundation models. The evaluation metric was MSE. The results are shown in Table 1:
The results are quite impressive:
On average, Few-shot TTM surpassed all the other models.
Even Zero-shot TTM was able to surpass some models! Remember, Zero-shot TTM produces forecasts without being trained on these data.
TTM also surpassed GPT4TS, a new foundation time-series model introduced last year.
Apart from the winning TTM, the next highest-ranking models are GPT4TS, PatchTST, and TSMixer—all of which utilize patching. Recent research in time-series forecasting has shown that patching is a highly beneficial technique.
TTM-Advanced vs SOTA models
Next, the authors compare the enhanced TTM versions with the rest of the SOTA models. The results are shown in Table 2:
We notice that TTM models, especially the larger TTM-A achieves the best overall score.
Surprisingly, the much larger GPT4TS (a recently introduced foundation model) doesn't achieve neither first nor second place in any of the datasets.
TTM vs foundation models
The following evaluation holds significant value. The authors evaluate TTM individually against foundation models, notably MOMENT, TimesFM, MOIRAI and Chronos.
The comparison results are displayed in Table 3 (MOIRAI, TimesFM) and Table 4 (Chronos, Lag-Llama). Notice that we only evaluate zero-shot performance (performance on unseen data):
We notice the following:
In both comparisons, TTM is the clear winner.
All TTM variants are extremely efficient, both in terms of performance and model size.
Effectiveness of Exogenous Variables
Modern real-world datasets use exogenous variables whenever possible — hence it makes sense to leverage them in a forecasting application.
The authors of TTM also investigated how TTM improves by using such variables (if applicable). Specifically, they compared Zero-Shot TTM, plain TTMQ, and TTM with channel-mixing (TTMQ-CM) that uses exogenous variables.
They also evaluated TSMixer with its channel-mixing variants and the other foundation models. The results are shown in Table 5:
We observe the following:
TTMQ-CM ranks first — meaning that exogenous variables indeed help the model.
The TSMixer variants that use channel-mixing properties come in second place.
Lastly, the other foundation models, even when fine-tuned, do not reach the performance of TTM.
Additional Remarks
It is evident that TTM surpasses the other models, both in terms of performance and inference speed. Figure 5 below demonstrates a top-level view comparison of TTM with the other foundation models:
TTM Explainability
One weakness of neural network-based models, let alone foundation models, is explainability and/or interpretability.
TTM’s superior design allows for natively extracting meaningful insights regarding feature importance.
At the beginning of the article, we talked about gated-attention, a mechanism that TTM uses to capture the importance of vector weights. We can leverage this mechanism to calculate an attention weight for each feature across channels — hence capturing feature importance.
Figure 6 shows the channel attention map for the UCI’s bike sharing dataset (one of the datasets with exogenous variables where TTM was evaluated):
As expected, the attention map reveals that the model focuses on weather variables, such as “weathersit” (categorical variable, with values Clear
, Few clouds
, Partly cloudy
, cloudy
), “season”, and “holiday”, to estimate bike-rental counts.
This capability, especially in a zero-shot foundation TS model can be really helpful in many use cases.
Tiny Time Mixers In Practice
Find the full project tutorial on the AI Projects Folder here: Get Project #4
We can download the weights of model versions 512-96 and 1024-96 on HuggingFace:
zeroshot_model = TinyTimeMixerForPrediction.from_pretrained("ibm/TTM", revision='main')
finetune_forecast_model = TinyTimeMixerForPrediction.from_pretrained("ibm/TTM", revision='main', head_dropout=0.0,dropout=0.0,loss="mse")
Hence, we can use the familiar Trainer module from the transformers library to finetune TTM:
finetune_forecast_trainer = Trainer(
model=finetune_forecast_model,
args=finetune_forecast_args,
train_dataset=train_dataset,
eval_dataset=valid_dataset,
callbacks=[early_stopping_callback, tracking_callback],
optimizers=(optimizer, scheduler))
# Fine tune
finetune_forecast_trainer.train()
And then get predictions for our dataset:
predictions_test = finetune_forecast_trainer.predict(test_dataset)
Closing Remarks
Tiny Time Mixer (TTM) is a novel model that followed a different approach — and paves the way for smaller, but efficient models.
Specifically, TTM didn’t use Attention (or any Transformer-based methodology) and proved that we can still build a formidable TS foundation model.
The first time-series models with meta-learning capabilities that only used MLPs were N-BEATS and N-HITS. Let’s see how this trend continues.
Recently, we have observed this trend with NLP models as well. We see Mamba(State Space) xLSTM (RNN-based) and Hyena (CNN-based) which are language models but not Transformers, achieving impressive results in numerous benchmarks.
Let’s see how this approach unfolds for time-series models as well — after all, the research of foundation models is still new for time-series!
Thank you for reading!
Horizon AI Forecast is a reader-supported newsletter, featuring occasional bonus posts for supporters. Your support through a paid subscription would be greatly appreciated.
References
[1] Ekambaram et al., Tiny Time Mixers (TTMs): Fast Pre-trained Models for Enhanced Zero/Few-Shot Forecasting of Multivariate Time Series (April 2024)
[2] Ekambaram et al., TSMixer: Lightweight MLP-Mixer Model for Multivariate Time Series Forecasting (June 2023)
Hi, thank you very much for sharing such an interesting source. Have you guys provided the TTMQ model detail? I didn't see any of them in your GitHub. Curious about that.
Very interesting. Is the pre training time series dataset in any way related to the evaluation data sets? For zero shot or others?