Skip to content

Commit e8b8315

Browse files
authored
0.0.1. Hopefully.
* where it started from * initial work 1 * snowstorm * holidays * a sound like thunder * you never know * Witcher S1E1 I read quite a few of the books, new show looks surprisingly historically accurate in appearance... by movie standards. * Witcher S1E3 * Christmas Eve * Christmas * Early Morning Fog * Verily I say onto thee * skiing * but I am the Chosen One * sorta works a bit * ensemble of doom * ensemble v 1 * nvidia graphics update, better save quick! * single iteration functional * take the keys and see what happens! * cold winds blow * frenetic sprint to a shifting shambles * finally, regressing a bit * ittle bitty changes * really should have pulled first... * It works. In an awkward sorta way * 0.0.1, hopefully
1 parent b26fe79 commit e8b8315

30 files changed

+5971
-2
lines changed

MANIFEST.in

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
include autots/datasets/data/*.csv
2+
include README.md LICENSE

README.md

Lines changed: 148 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,151 @@
11
# AutoTS
2+
Unstable prototype: version 0.0.1
23
### Project CATS (Catlin Automated Time Series)
3-
Model Selection for Multiple Time Series
4+
(or maybe eventually: Clustered Automated Time Series)
5+
#### Model Selection for Multiple Time Series
46

5-
Simple package for comparing open-source time series implementations.
7+
Simple package for comparing and predicting with open-source time series implementations.
8+
For other time series needs, check out the package list here: https://github.com/MaxBenChrist/awesome_time_series_in_python
9+
10+
`pip install autots`
11+
#### Requirements:
12+
Python >= 3.5 (typing) >= 3.6 (GluonTS)
13+
pandas
14+
sklearn >= 0.20.0 (ColumnTransformer)
15+
statsmodels
16+
holidays
17+
18+
`pip install autots['additional models']`
19+
#### Requirements
20+
fbprophet
21+
fredapi (example datasets)
22+
23+
## Basic Use
24+
Input data is expected to come in a 'long' format with three columns: Date (ideally already in pd.DateTime format), Value, and Series ID. the column name for each of these is passed to .fit(). For a single time series, series_id can be = None.
25+
26+
```
27+
from autots.datasets import load_toy_daily
28+
df_long = load_toy_daily()
29+
30+
from autots import AutoTS
31+
model = AutoTS(forecast_length = 14, frequency = 'infer',
32+
prediction_interval = 0.9, ensemble = True, weighted = False,
33+
max_generations = 5, num_validations = 2, validation_method = 'even')
34+
model = model.fit(df_long, date_col = 'date', value_col = 'value', id_col = 'series_id' )
35+
36+
# Print the name of the best mode
37+
print(model.best_model['Model'].iloc[0])
38+
39+
prediction = model.predict()
40+
# point forecasts dataframe
41+
forecasts_df = prediction.forecast
42+
# accuracy of all tried model results (not including cross validation)
43+
model_results = model.main_results.model_results
44+
```
45+
46+
## Underlying Process
47+
AutoTS works in the following way at present:
48+
* It begins by taking long data and converting it to a wide dataframe with DateTimeIndex
49+
* An initial train/test split is generated where the test is the most recent data, of forecast_length
50+
* A random template of models is generated and tested on the initial train/test
51+
* Models consist of a pre-transformation step (fill na options, outlier removal options, etc), and algorithm (ie ETS) and model paramters (trend, damped, etc)
52+
* The top models (selected by a combination of SMAPE, MAE, RMSE) are recombined with random mutations for n_generations
53+
* A handful of the best models from this process go to cross validation, where they are re-assessed on new train/test splits.
54+
* The best model in validation is selected as best_model and used in the .predict() method to generate forecasts.
55+
56+
## Caveats and Advice
57+
58+
#### Short Training History
59+
How much data is 'too little' depends on the seasonality and volatility of the data.
60+
But less than half a year of daily data or less than two years of monthly data are both going to be tight.
61+
Minimal training data most greatly impacts the ability to do proper cross validation. Set num_validations = 0 in such cases.
62+
Since ensembles are based on the test dataset, it would also be wise to set ensemble = False if num_validations = 0.
63+
64+
#### Too much training data.
65+
Too much data is already handled to some extent by 'context_slicer' in the transformations, which tests using less training data.
66+
That said, large datasets will be slower and more memory intensive, for high frequency data (say hourly) it can often be advisable to roll that up to a higher level (daily, hourly, etc.).
67+
Rollup can be accomplished by specifying the frequency = your rollup frequency, and then setting the agg_func = 'sum' or 'mean' or other appropriate statistic.
68+
69+
#### Lots of NaN in data
70+
Various NaN filling techniques are tested in the transformation. Rolling up data to a lower frequency may also help deal with NaNs.
71+
72+
#### More than one preord regressor
73+
'Preord' regressor stands for 'Preordained' regressor, to make it clear this is data that will be know with high certainy about the future.
74+
Such data about the future is rare, one example might be number of stores that will be (planned to be) open each given day in the future when forecast sales.
75+
Since many algorithms do not handle more than one regressor, only one is handled here. If you would like to use more than one,
76+
manually select the best variable or use dimensionality reduction to reduce the features to one dimension.
77+
However, the model can handle quite a lot of parallel time series. Additional regressors can be passed through as additional time series to forecast.
78+
The regression models here can utilize the information they provide to help improve forecast quality.
79+
To prevent forecast accuracy for considering these additional series too heavily, input series weights that lower or remove their forecast accuracy from consideration.
80+
81+
#### Categorical Data
82+
Categorical data is handled, but it is handled poorly. For example, optimization metrics do not currently include any categorical accuracy metrics.
83+
For categorical data that has a meaningful order (ie 'low', 'medium', 'high') it is best for the user to encode that data before passing it in,
84+
thus properly capturing the relative sequence (ie 'low' = 1, 'medium' = 2, 'high' = 3).
85+
86+
#### Custom Metrics
87+
Implementing new metrics is rather difficult. However the internal 'Score' that compares models can easily be adjusted by passing through custom metric weights.
88+
Higher weighting increases the importance of that metric.
89+
`metric_weighting = {'smape_weighting' : 9, 'mae_weighting' : 1, 'rmse_weighting' : 5, 'containment_weighting' : 1, 'runtime_weighting' : 0.5}`
90+
sMAPE is generally the most versatile across multiple series, but doesn't handle forecasts with lots of zeroes well.
91+
Contaiment measures the percent of test data that falls between the upper and lower forecasts.
92+
93+
## To-Do
94+
* Smaller
95+
* Recombine best two of each model, if two or more present
96+
* Duplicates still seem to be occurring in the genetic template runs
97+
* Inf appearing in MAE and RMSE (possibly all NaN in test)
98+
* Na Tolerance for test in simple_train_test_split
99+
* Relative/Absolute Imports and reduce package reloading
100+
* User regressor to sklearn model regression_type
101+
* Import/export template
102+
* ARIMA + Detrend fails
103+
* Things needing testing:
104+
* Confirm per_series weighting works properly
105+
* Passing in Start Dates - (Test)
106+
* Different frequencies
107+
* Various verbose inputs
108+
* Test holidays on non-daily data
109+
* Handle categorical forecasts where forecast leaves known values
110+
* Speed improvements, Profiling, Parallelization, and Distributed options for general greater speed
111+
* Generate list of functional frequences, and improve usability on rarer frequenices
112+
* Warning/handling if lots of NaN in most recent (test) part of data
113+
* Figures: Add option to output figures of train/test + forecast, other performance figures
114+
* Input and Output saved templates as .csv and .json
115+
* 'Check Package' to check if optional model packages are installed
116+
* Pre-clustering on many time series
117+
* If all input are Int, convert floats back to int
118+
* Trim whitespace on string inputs
119+
* Hierachial correction (bottom-up to start with)
120+
* Improved verbosity controls and options. Replace most 'print' with logging.
121+
* Export as simpler code (as TPOT)
122+
* AIC metric, other accuracy metrics
123+
* Analyze and return inaccuracy patterns (most inaccurate periods out, days of week, most inaccurate series)
124+
* Used saved results to resume a search partway through
125+
* Generally improved probabilistic forecasting
126+
* Option to drop series which haven't had a value in last N days
127+
* Option to change which metric is being used for model selections
128+
* Use quantile of training data to provide upper/lower forecast for Last Value Naive (so upper forecast might be 95th percentile largest number)
129+
* More thorough use of setting random seed
130+
* For monthly data account for number of days in month
131+
* Option to run generations until generations no longer see improvement of at least X % over n generations
132+
133+
#### New Ensembles:
134+
best 3 (unique algorithms not just variations)
135+
forecast distance 30/30/30
136+
best per series ensemble
137+
best point with best probalistic containment
138+
#### New models:
139+
Seasonal Naive
140+
Last Value + Drift Naive
141+
Simple Decomposition forecasting
142+
GluonTS Models
143+
Simulations
144+
Sklearn + TSFresh
145+
Sklearn + polynomial features
146+
Sktime
147+
Ta-lib
148+
tslearn
149+
pydlm
150+
Isotonic regression
151+
TPOT if it adds multioutput functionality

autots/__init__.py

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
"""
2+
Automated Time Series Model Selection for Python
3+
4+
https://github.com/winedarksea/AutoTS
5+
"""
6+
from autots.datasets import load_toy_daily
7+
from autots.evaluator.auto_ts import AutoTS
8+
9+
__version__ = '0.0.1'
10+
11+
__all__ = ['load_toy_daily', 'AutoTS']
12+
13+
# import logging
14+
# logger = logging.getLogger(__name__)
15+
# logger.addHandler(logging.StreamHandler())
16+
# logger.setLevel(logging.INFO)

autots/datasets/__init__.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
"""
2+
Tools for Importing Sample Data
3+
"""
4+
from autots.datasets._base import load_toy_daily
5+
6+
__all__ = ['load_toy_daily']

autots/datasets/_base.py

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
from os.path import dirname, join
2+
import numpy as np
3+
import pandas as pd
4+
5+
def load_toy_daily():
6+
"""
7+
4 series of sample daily data from late 2019
8+
Testing some basic missing and categorical features.
9+
"""
10+
module_path = dirname(__file__)
11+
data_file_name = join(module_path, 'data', 'toy_daily.csv')
12+
13+
df_long = pd.read_csv(data_file_name)
14+
df_long['date'] = pd.to_datetime(df_long['date'], infer_datetime_format = True)
15+
16+
return df_long

0 commit comments

Comments
 (0)