AV-Janatahack-Independence-Day-2020-ML-Hackathon

Multi-label topic classification for scientific articles using transformer models. Implemented and compared 7+ architectures (BERT, RoBERTa, SciBERT, XLNet, T5) with ensemble methods, achieving 85.8% F1-score on 6-topic classification

Team Name : FSociety
Creators :

Rank:

Private Leaderboard Rank: 4
Public Leaderboard Rank: 5

This repository contains the code implemented during the hackathon.

Problem Statement :

Topic Modeling for Research Articles

Researchers have access to large online archives of scientific articles. As a consequence, finding relevant articles has become more difficult. Tagging or topic modelling provides a way to give token of identification to research articles which facilitates recommendation and search process.
Given the abstract and title for a set of research articles, predict the topics for each article included in the test set. Note that a research article can possibly have more than 1 topic. The research article abstracts and titles are sourced from the following 6 topics:

Computer Science
Physics
Mathematics
Statistics
Quantitative Biology
Quantitative Finance

Data :

Models :

BERT

A language representation model, which stands for Bidirectional Encoder Representations from Transformers. BERT is a multi-layer bidirectional Transformer's encoder stack.
- Architectures for multi-label classification:
  1. Pooled outputs + Classification Layer
  2. Sequence outputs + Spatial dropout + Mean & Max pooling + Classification layer
- Code & Notebooks
  1. bert-base-uncased for multilabel classification
RoBERTa

RoBERTa: A Robustly Optimized BERT Pretraining Approach. It builds on BERT and modifies key hyperparameters, removing the next-sentence pretraining objective and training with much larger mini-batches and learning rates.
- Architectures for multi-label classification:
  1. Pooled outputs(roberta-base) + Classification Layer
  2. Pooled outputs(roberta-large) + Classification Layer
  3. Dual input + Single head + Concatenation + Classification Layer
- Code & Notebooks
ALBERT

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. ALBERT uses repeating layers which results in a small memory footprint, however the computational cost remains similar to a BERT-like architecture with the same number of hidden layers as it has to iterate through the same number of (repeating) layers.
- Architectures for multi-label classification:
  1. Pooled outputs(albert-base-v2) + Classification Layer
- Code & Notebooks
  1. albert-base for multi-label classification
Longformer

Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length. To address this limitation, the Longformer was introduced with an attention mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer.
- Architecture for multi-label classification:
  1. Pooled outputs(allenai/longformer-base-4096) + Classification Layer
- Code & Notebooks
  1. longformer-base for multi-label classification
T5

T5 is an encoder-decoder model pre-trained on a multi-task mixture of unsupervised and supervised tasks and for which each task is converted into a text-to-text format. T5 works well on a variety of tasks out-of-the-box by prepending a different prefix to the input corresponding to each task, e.g.: for translation: translate English to German, etc.
- Architecture for multi-label classification:
  1. Complete Text-to-Text Transformer(encoder stack + decoder stack)
- Code & Notebooks
  1. T5 for multi-label classification
XLNet

XLnet is an extension of the Transformer-XL model pre-trained using an autoregressive method to learn bidirectional contexts by maximizing the expected likelihood over all permutations of the input sequence factorization order.
- Architecture for multi-label classification:
  1. Pooled outputs +Classification layer
- Code & Notebooks
  1. xlnet for multi-label classification
SciBERT

It is a BERT model trained on scientific text. SciBERT is trained on papers from the corpus of semanticscholar.org. Corpus size is 1.14M papers, 3.1B tokens
- Architectures for multi-label classification:
  1. Pooled outputs + Classification layer
  2. Sequence outputs + Spatial dropout + BiLstm + Classification layer
  3. Siamese like architecture: Dual inputs(single head) + Pooled outputs + Avg pooling + Concatenation + Classification layer
  4. Siamese like architecture: Dual inputs(single head) + Sequence outputs + Bi-GRU + Classification layer
  5. Dual inputs(dual head) + Sequence outputs + Avg pooling + Concatenation + Classification layer
  6. Scibert embeddings + XGBoost
  7. Scibert embeddings + LGBM
  8. Scibert + XLNet
- Code & Notebooks

Performance of implemented models :

Model	Public LB f1-micro	Private LB f1-micro
bert-base-uncased	0.828077	0.827281
albert-base-v2	0.824307	0.824409
longformer-base-4096	0.833407	0.834856
roberta-base	0.810430	0.807715
siamese-roberta-base	0.831624	0.832628
roberta-large	0.823807	0.829286
xlnet	0.835541	0.837154
t5-base	0.824055	0.823033
scibert	0.845831	0.849557
scibert + lgbm	0.841710	0.845912
scibert + xgboost	0.844890	0.849602
siamese-scibert + gru	0.845310	0.853427
scibert + bilstm	0.846365	0.849017
scibert-fft	0.845831	0.849557
avg-blend	0.853915	0.857981
weighted-avg-blend	0.854491	0.858294

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
albert-base		albert-base
bert-base		bert-base
data		data
longformer		longformer
roberta-base		roberta-base
roberta-dual-input		roberta-dual-input
roberta-large		roberta-large
scibert-bilstm		scibert-bilstm
scibert-dual-input		scibert-dual-input
scibert-gradient-boosting		scibert-gradient-boosting
scibert-xlnet-ensemble		scibert-xlnet-ensemble
scibert		scibert
t5-base		t5-base
xlnet-base		xlnet-base
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AV-Janatahack-Independence-Day-2020-ML-Hackathon

Problem Statement :

Topic Modeling for Research Articles

Data :

Models :

BERT

Architectures for multi-label classification:

Code & Notebooks

RoBERTa

Architectures for multi-label classification:

Code & Notebooks

ALBERT

Architectures for multi-label classification:

Code & Notebooks

Longformer

Architecture for multi-label classification:

Code & Notebooks

T5

Architecture for multi-label classification:

Code & Notebooks

XLNet

Architecture for multi-label classification:

Code & Notebooks

SciBERT

Architectures for multi-label classification:

Code & Notebooks

Performance of implemented models :

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

shanayghag/topic-modeling

Folders and files

Latest commit

History

Repository files navigation

AV-Janatahack-Independence-Day-2020-ML-Hackathon

Problem Statement :

Topic Modeling for Research Articles

Data :

Models :

BERT

Architectures for multi-label classification:

Code & Notebooks

RoBERTa

Architectures for multi-label classification:

Code & Notebooks

ALBERT

Architectures for multi-label classification:

Code & Notebooks

Longformer

Architecture for multi-label classification:

Code & Notebooks

T5

Architecture for multi-label classification:

Code & Notebooks

XLNet

Architecture for multi-label classification:

Code & Notebooks

SciBERT

Architectures for multi-label classification:

Code & Notebooks

Performance of implemented models :

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages