Inconsistencies in the Husformer Paper and Code

Hello,

I have studied the Husformer paper and the associated code, and I've noticed some inconsistencies. If you're reading this, you might be attempting to reproduce the paper's results.

I also tried, and my conclusion is that I wasn't able to observe any significant score differences between Husformer and simpler neural network architectures like MLP or LSTM for the datasets CogLoad, MOCAS and WESAD. I cannot post the specific code I used for my reproduction attempt because the original Husformer code does not appear to have an open-source license (e.g., MIT) that would permit modification and redistribution of derivative works.

However, here are the points I'd like to highlight regarding the paper and the available code:

In the paper (https://arxiv.org/pdf/2209.15182):

- The formula for accuracy in multi-class classification appears incorrect. In Section IV.D, it states that the accuracy for the *i*-th class is: accuracyi = (TPi + TNi) / lengthi, where lengthi is "the number of total samples in the i-th class." A simple example illustrates the issue: Imagine a classifier for 3 classes (A, B, C), with 5 samples for each class. If the classifier predicts everything correctly, using this formula, we would have: accuracyA = (TPA + TNA) / lengthA = (5TPA + 10TNA) / 5lengthA = 3. However, an accuracy value should typically be between 0 and 1.

- The paper presents the framework as "end-to-end," but in Section III.F, Algorithm 1, the computations for the initial layers (temporal convolutions and positional encoding, steps 6-7) appear to be outside the main training loop (steps 8-18). Why is this the case if it's truly end-to-end, where all parameters would typically be learned within the iterative training process?

- At the end of Section IV.A, it's mentioned that the input data used in the experiments are 1-second segments. This raises several questions:
-- The paper states multiple times that transformers are used to model "long-term interactions" between modalities. How can a 1-second window be considered "long-term" in this context?
-- For the CogLoad dataset (Table IV), given the sampling rates (mostly 1Hz), a 1-second segment might correspond to an input vector with very few values (e.g., potentially 6 values if using HR, IBI, GSR, SKT, and 2 ACC channels, all at 1Hz). How can there be such significant differences in accuracies (as shown in Table VI, Section IV) between the different methods tested on the CogLoad dataset with such potentially sparse input per segment?

In the code :

- Section III.F of the paper (Equation 9) mentions that the network's output goes through a softmax function, and the paper clearly frames the tasks as classification problems. Therefore, the final output of the model should be a probability distribution over the classes.
In the provided code (e.g., main-5.py), why does the neural network is set to have an output dimension of 1? Furthermore, in eval_metrics.py, the metrics appear to be designed to receive scalar values, and no softmax operation seems to be applied before computing these metrics for classification evaluation. This contradicts the paper's description.

- In data/cogload.pkl the modality names are EEG, GSR, BVP and POW when you load the data but if you look at the values they seem to correspond to the modalities IBI, GSR, HR and ACC.

- There are other problems with the code but they mostly come from fact that they took the code from https://github.com/yaohungt/Multimodal-Transformer and didn't change it too much

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistencies in the Husformer Paper and Code #10

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Inconsistencies in the Husformer Paper and Code #10

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions