11 Comments
User's avatar
Konrad Banachewicz's avatar

Love the analysis, got me thinking though: is there any way we can check whether the Kaggle data was not used in training? For a regular LLM one way would be try and trick the model into generating specific content (like an opening paragraph of a book we suspect was used), but I can't think of a reasonable analogy for time series.

Expand full comment
Nikos Kafritsas's avatar

Thank you Konrad! The authors clearly mentioned in the paper the datasets they used. This one was not mentioned, and one of the authors saw my benchmark (he would have pointed it out if there had been any leakage).

I tried a trick in the past with the first generation of TSFMs, and it often worked: select a dataset and compute the ratio MSE(dataset_predictions) / MSE(dataset+noise_predictions). If the ratio drops significantly, that’s an indication of data leakage. I repeated this test recently with MOIRAI-2, and it seems it doesn’t work anymore. I guess this is because TSFMs use extensive data augmentation, sampling, subsetting, etc., and the original time series may not be used directly during pretraining.

This brings me to my next point. The Chronos variant trained only on synthetic data performed just slightly worse than the released version. Last week TabPFN-2.5 also showed decent scores despite being trained on synthetic data as well. Of course, these are benchmark and not real-word results, but is synthetic data the key to structured models?

Expand full comment
Konrad Banachewicz's avatar

1. The ratio is a nice one - hadn't thought of that.

2. Could very well be, although I always felt like synthetic data was just kicking the can down the road - at some point you're going to run out of it, and the tea does not get sweeter from stirring alone.

Expand full comment
Nikos Kafritsas's avatar

Certainly, that's true. But it's still amazing how synthetic-only data improves so much an attention-based model on a "structured" modality like tabular data and time series. There might be a new a paradigm here.

Expand full comment
Graeme's avatar

Hi Nikos this is an excellent article in your current run on LLM models. In a previous post you mentioned that number of observations dictated the utility of LLM based forecasters ( I can’t remember the suggest length 1200 obs?). I wondered if given the cross learning capability of chronos 2 with panel sets of related data if the length of the series could allow it to work better with shorter series than other LLMs?

Expand full comment
Nikos Kafritsas's avatar

Thank you, Graeme! That observation of mine is a rule of thumb based on my current experience. In general, increasing the context length to around 1000 also improves performance. This is especially true for data with clear, multiple, or hidden seasonalities.

However, this isn’t always the case. If you have intermittent or irregular data where part of the signal is actually noise, increasing the context length might not help (I mention this in my previous article). You can also check my MOIRAI-2 tutorial, where I test it on the BOOM dataset (observability data). You’ll see that if you increase the context length, the model can actually perform worse:

https://aihorizonforecast.substack.com/p/using-moirai-2-to-outperform-statistical

Shorter-length data is not a problem either. Cross-learning compensates for this by learning cross-dependencies across multiple time series. If there is useful signal, cross-learning will capture it. The Chronos Benchmark II was built specifically to test this behaviour.

One more note: Chronos-2 is not an LLM (although it borrows some architectural elements from LLMs). LLMs learn the conditional probability of the next word/token given an input context by selecting the most probable token from a fixed vocabulary, minimizing cross-entropy loss.

I mention this because there are native LLM-based foundation models like TimeLLM and Lllmtime that use an LLM as a backbone. These have not been successful, and they are generally not worth spending time testing ;)

Expand full comment
Graeme's avatar

Thank you for the through response Nikos, now you’ve mentioned it I’m not sure why I though cross learning would help with shorter series. I largely neglected foundational models. This article has drawn me to it, as exogenous variables and cross learning are critical in a current project, which I wasn’t certain was an option with foundational models before, or enough so to make them useful. Glad to how much progress is happening. It’s great to have someone like yourself providing material on this. Much appreciated.

Expand full comment
Nikos Kafritsas's avatar

Thank you Graeme for your kind words! I put together a mini-example to show how cross-learning can help with short time series. You can find it in Project 27 in the AI Projects Folder. (I’ll soon publish a companion article with more details).

https://aihorizonforecast.substack.com/p/ai-projects

In short, cross-learning (essentially multivariate forecasting) is especially useful for cold-start forecasting. For example, imagine you have a dataset with 50 time series, but one of them has very little history, making accurate forecasting difficult. With cross-learning, Chronos-2 leverages information from the other time series in the dataset to improve predictions for the short-history series.

Expand full comment
Meenakshi NavamaniAvadaiappan's avatar

Thanks for the good 😊

Expand full comment
Nikos Kafritsas's avatar

Glad you liked it!

Expand full comment
Ioannis Lazaridis's avatar

Χαίρομαι πολύ όταν βλέπω τό έντονο ενδιαφέρον γιά έρευνα καί σωστά αποτελέσματα. Πολύ ενδιαφέρον καί ουσιαστικοί οί διάλογοι.

Μακάρι νά γνώριζα κι εγώ τό αντικείμενο. Θά αφοσιωνόμουν ολόψυχα στήν έρευνα καί βελτίωση.

Μπράβο Νικόλαε.

Expand full comment