Description
I would like to compare the forecasting performance of TimeGPT with other time series forecasting methods (e.g., TimeLLM). However, these methods adopt different evaluation strategies compared to TimeGPT.
For example, with a test set length of 1000, an input length of 96, and a prediction horizon of 96:
The typical evaluation procedure for other models is as follows: first, feed rows 0–95 to predict 96–191; then feed rows 1–96 to predict 97–192; and so on, until the entire test set is covered.
Under this setup, metrics such as MAE are computed over all predictions. In other words, the output shape of such models is:
(test set length−input length−prediction length, prediction length)
(test set length−input length−prediction length, prediction length)
and evaluation metrics are then calculated on this output.
Therefore, to ensure a fair comparison, TimeGPT’s forecasting process needs to be adjusted so that it generates predictions under a rolling window evaluation scheme, aligning both the output shape and the metric computation with the above methods.
How can this be achieved with TimeGPT?
Use case
No response