| layout | global | |
|---|---|---|
| title | Time series analysis on surrogate data | |
| categories |
|
|
| navigation | ||
| weight | 100 | |
| show | true | |
| skip-chapter-toc | true |
https://github.com/bellettif/sparkGeoTS
In the shell, from the usb/spark/, please enter
usb/$ ./bin/spark-shell --master local[4] --jars ../timeseries/sparkgeots.jar --driver-memory 2G
and then please copy and paste the following in the Spark shell:
import breeze.linalg._
import breeze.stats.distributions.Gaussian
import breeze.numerics.sqrt
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
import main.scala.overlapping._
import containers._
import timeSeries._
implicit def signedDistMillis = (t1: TSInstant, t2: TSInstant) => (t2.timestamp.getMillis - t1.timestamp.getMillis).toDouble
In this tutorial we are going to practice on artificially generated data and show that we are able to successfully identify autoregressive models.
Let's generate some data with an order 3 Autoregressive Model.
In such a model X_t = A_1 * X_{t-1} + A_2 * X_{t-2} + A_3 * X_{t-3} + iid noise.
We have 3 spatial dimensions (3 sensors or data feeds).
Let's generate 10000 samples (you can also try a million for instance).
Let's specify that there is one millisecond between each sample.
Let's have an overlap between partitions of 100 ms. The data in our partitions will overlap as follows:
-----------------------------------
---------------------------------
This is necessary to estimate our models without shuffling data between nodes. With this setup, we will be able to calibrate models of any order lower that 100 ms / 1 ms = 100.
We choose to have 8 partitions.
We gather all that information into the implicit val config which will be used later on in all the calls we make.
(1) We generate the coefficients of the model randomly and try to enforce causality.
We have the same amount of noise everywhere.
Let's generate the surrogate data.
And put it in the overlapping data structure
We can inspect the parameters and plot some data
Note: If at this point the plot shows there are NAN values, it means that the model we have randomly generated is numerically unstable. This can happen, let's just regenerate the coefficients of the model and the data. Go back to (1) if unfortunately this has happened.
Let's take a look at the auto-correlation structure of the data we have generated. If that correlation vanishes to 0 after lag q, the data we see is most likely generated by an MA(q) model.
Let's take a look at the partial auto-correlation structure of the data we have generated. If that correlation vanishes to 0 after lag p, the data we see is most likely generated by an AR(p) model.
- Let's fit a univariate AR model on each spatial dimension of the data
- Now we compute the prediction residuals (in sample).
- We also compute the variance - covariance matrix of the residuals.
- Let's inspect the result:
- Let's fit a multivariate AR model taking the information of all sensors jointly into account.
- We compute the predition residuals.
- Let's take a look at the variance - covariance matrix of the residuals. It should be closer to a diagonal matrix than in the univariate analysis.
- Let's compare the errors between univariate and multivariate models. We compute the mean squared errors which sum all squared errors along the sensing dimensions. The averate error we make when predicting is therefore obtained by dividing by the number of sensors and taking the square root of the result.