This readme file is an outcome of the CENG502 (Spring 2023) project for reproducing a paper without an implementation. See CENG502 (Spring 20223) Project List for a complete list of all paper reproduction projects.
The paper is published for CIKM 2022. The goal of the paper is to achieve better accuracies in multivarite time series anomaly detection. Our goal was to train the model suggested in the paper and see the result for ourselves
Anomaly detection in multivariate time series is a hot-topic in deep learning. this paper proposes a unique idea in terms of data normalization. Rather than using static mean and variance which is done in similiar papers, This paper suggests a dynamic method of normalization in test set which increases mdel's performence overall. Combining this with spatial correlation and temporal correlation methods gives top notch performence on time series anomly detection.
The paper is divided into three sections, data normalization, temporal correlation, and spatial correlation
Traditionally; in normalization procedures test, validation and training datas should all be normalized before sending it into network. This paper also adopts this technique using mean as 0 and variance as 1, but only in training and validation sets. The procedure for test set is entirely different. In test set we update our mean and variance using previous mean and variances, This allows us to avoid the phenomenon known as concept drifting. Concept drifting refers to the phenomenon in data science where the statistical properties of a target variable or the relationships between features and the target variable change over time, which may cause huge distortions in our anomaly prediction platform , we might take anomalies as usual data because of it or vice versa.
Figure 1: Illustration of the concept drifting phenomenon.
In order to avoid this we use the following formulas in acquiring the next mean and next variances over time.
Where
This module uses the correlation between historical and future points. Paper suggests that using simple MLP blocks rather than using complex architectures such as Transformers etc. gives both better computation speed and better accuracy in the end, therefore spatial module uses only MLP blocks and no other structure.
This module uses a window of size
Figure 2: Illustration of the input temporal data.
The relation between those three blocks and MLP block is as follows
where
Spatial module works on each of the time-series separately therefore does not exploit the features of time based data like temporal correlation module. In order to extract the features between the elements in a time series data, paper suggests using multi head self attention mechanism in order to imitate graph neural netowrks which is a popular solution for this problem. Each row of the output of Temporal correlation module has been used as a vertex; and the message passing is done via following
where
Paper also uses a final outut reshaping module which is a simple MLP used in order to change the dimension of output
Paper suggests using the frobenius norm for loss. Frobenius norm is basically the L2 norm which corresponds to the lenght of the vector. Therefore Frobenius Loss is the length of the difference of two vectors. Paper suggests using a threshold for this loss; if the threshold is exceeded model informs that this data is an anomaly.
- The threshold for frobenius norm has not been specified therefore we tried many random thresholds for best performence. Best is around 1.85.
- We followed every constant given by paper for anomaly detection. The window length is 100; number of elements to predict in future is 1. Alpha used in normalization is 0.1 and so on.
- We moved the windows with interval of 5 rather than traditional 1 for testing.
- We could only test the dataset of SMAP.
- We did not used CUDA for training. It is not mentioned in the paper but looking at the time statistic it seems the authors also have not used CUDA support. Therefore in order to get the most similar result we also have not used it.
In accordance with the findings presented in section 2.1.2, the authors employed a strategy of dividing the dataset into windows. Notably, Figure 4 in the article illustrates that these windows do not overlap. However, when considering the equation provided for calculating the number of targets:
where windowed_Set(). This function accepts the following input parameters: original_data, window_size, shifting, and horizon. By utilizing the windowed_Set function, one can manipulate the data in the following manner:
windows_size : Determine the desired length of the window.
shifting : Specify the number of jumps for the window, akin to the stride value used in convolution.
horizon : Indicate the number of predicted points within the upcoming windows.
The illustration for this function and handling behavior can be seen in the Figure 1 below.
Figure 3: Illustration for windowed_Set function.
With each iteration paper suggests using a window of data ( where window length is 100 in this case ), and iterate the data in training by 1. Because of the performence issues we shifted the window by 5 rather than 1. After this we put our data into our temporal correlation module with MLPs. Since we used pytorch this was an easy process; and we completely implemented what paper was suggesting. Following temporal correlation, we put our outpur of datas into spatial correlation module; which is also fully implementable on pytorch therefore there is not any changes with the setup given in the paper. After this process we simpply used an output moduel which is basically a MLP module which is also implemented without problems. All of the setup requirements for neural networks are given in the following figure.
Figure 4: The chart for parameters in the neural network
For tests we used Frobenius norm (L2 norm) as suggested without any changes.
Directory structure:
├── SMAP MSL/
│ ├── data/data
│ │ ├── 2018-05-19_15.00.10/
│ │ │ test/
│ │ │ train/
│ └── labeled_anomalies.csv
├── Marina.ipynb
├── README.md
└── marina.pdf- One should use
Marina.ipynbfor training and testing the data.
In this section, they implement the model for forecasting target.
We created the model and trained the model with 30 epochs as mentioned in the paper. Then we used optuna for optimizing the hyperparameters such as layer numbers and batch sizes. Here the loss curves can be seen in the below figure.
Figure 5: Training and validation loss for Etth1 dataset
Then we calculated the MSE and MAE loss values for this dataset with the model. The results can be seen in the table below.
Table 1: Metric scores for different models and ours for Etth1 dataset
The following chart shows the start and end sequences of anomalies in SMAP's first five entries.
Figure 5: The anomaly sequences of SMAP dataset
And the following chart shows us the anomaly sequences that are found by our model.
Figure 6: The anomaly sequences of our model
As you can see our model's accuracies ranges from 0.2 to 0.6; our mean accuracies 0.4 overall.
@TODO: Discuss the paper in relation to the results in the paper and your results.
[1] Original paper: Xie, J., Cui, Y., Huang, F., Liu, C., Zheng, K. (2022). MARINA: An MLP-Attention Model for Multivariate Time-Series Analysis. In Proceedings of the CIKM Conference on Information and Knowledge Management.
Furkan Bahçeli, [email protected]
Batuhan Tosyalı, [email protected]