Multi-label topic classification for scientific articles using transformer models. Implemented and compared 7+ architectures (BERT, RoBERTa, SciBERT, XLNet, T5) with ensemble methods, achieving 85.8% F1-score on 6-topic classification
Team Name : FSociety
Creators :
- Private Leaderboard Rank: 4
- Public Leaderboard Rank: 5
Researchers have access to large online archives of scientific articles. As a consequence, finding relevant articles has become more difficult. Tagging or topic modelling provides a way to give token of identification to research articles which facilitates recommendation and search process.
Given the abstract and title for a set of research articles, predict the topics for each article included in the test set.
Note that a research article can possibly have more than 1 topic. The research article abstracts and titles are sourced from the following 6 topics:
- Computer Science
- Physics
- Mathematics
- Statistics
- Quantitative Biology
- Quantitative Finance
-
A language representation model, which stands for Bidirectional Encoder Representations from Transformers. BERT is a multi-layer bidirectional Transformer's encoder stack.
-
- Pooled outputs + Classification Layer
- Sequence outputs + Spatial dropout + Mean & Max pooling + Classification layer
-
-
RoBERTa: A Robustly Optimized BERT Pretraining Approach. It builds on BERT and modifies key hyperparameters, removing the next-sentence pretraining objective and training with much larger mini-batches and learning rates.
-
- Pooled outputs(roberta-base) + Classification Layer
- Pooled outputs(roberta-large) + Classification Layer
- Dual input + Single head + Concatenation + Classification Layer
-
-
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. ALBERT uses repeating layers which results in a small memory footprint, however the computational cost remains similar to a BERT-like architecture with the same number of hidden layers as it has to iterate through the same number of (repeating) layers.
-
- Pooled outputs(albert-base-v2) + Classification Layer
-
-
Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length. To address this limitation, the Longformer was introduced with an attention mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer.
-
- Pooled outputs(allenai/longformer-base-4096) + Classification Layer
-
-
T5 is an encoder-decoder model pre-trained on a multi-task mixture of unsupervised and supervised tasks and for which each task is converted into a text-to-text format. T5 works well on a variety of tasks out-of-the-box by prepending a different prefix to the input corresponding to each task, e.g.: for translation: translate English to German, etc.
-
- Complete Text-to-Text Transformer(encoder stack + decoder stack)
-
-
XLnet is an extension of the Transformer-XL model pre-trained using an autoregressive method to learn bidirectional contexts by maximizing the expected likelihood over all permutations of the input sequence factorization order.
-
- Pooled outputs +Classification layer
-
-
It is a BERT model trained on scientific text. SciBERT is trained on papers from the corpus of semanticscholar.org. Corpus size is 1.14M papers, 3.1B tokens
-
- Pooled outputs + Classification layer
- Sequence outputs + Spatial dropout + BiLstm + Classification layer
- Siamese like architecture: Dual inputs(single head) + Pooled outputs + Avg pooling + Concatenation + Classification layer
- Siamese like architecture: Dual inputs(single head) + Sequence outputs + Bi-GRU + Classification layer
- Dual inputs(dual head) + Sequence outputs + Avg pooling + Concatenation + Classification layer
- Scibert embeddings + XGBoost
- Scibert embeddings + LGBM
- Scibert + XLNet
-
| Model | Public LB f1-micro | Private LB f1-micro |
| bert-base-uncased | 0.828077 | 0.827281 |
| albert-base-v2 | 0.824307 | 0.824409 |
| longformer-base-4096 | 0.833407 | 0.834856 |
| roberta-base | 0.810430 | 0.807715 |
| siamese-roberta-base | 0.831624 | 0.832628 |
| roberta-large | 0.823807 | 0.829286 |
| xlnet | 0.835541 | 0.837154 |
| t5-base | 0.824055 | 0.823033 |
| scibert | 0.845831 | 0.849557 |
| scibert + lgbm | 0.841710 | 0.845912 |
| scibert + xgboost | 0.844890 | 0.849602 |
| siamese-scibert + gru | 0.845310 | 0.853427 |
| scibert + bilstm | 0.846365 | 0.849017 |
| scibert-fft | 0.845831 | 0.849557 |
| avg-blend | 0.853915 | 0.857981 |
| weighted-avg-blend | 0.854491 | 0.858294 |