Skip to content

Multi-label topic classification for scientific articles using transformer models. Implemented and compared 7+ architectures (BERT, RoBERTa, SciBERT, XLNet, T5) with ensemble methods, achieving 85.8% F1-score on 6-topic classification

License

Notifications You must be signed in to change notification settings

shanayghag/topic-modeling

Repository files navigation

AV-Janatahack-Independence-Day-2020-ML-Hackathon


Multi-label topic classification for scientific articles using transformer models. Implemented and compared 7+ architectures (BERT, RoBERTa, SciBERT, XLNet, T5) with ensemble methods, achieving 85.8% F1-score on 6-topic classification

Team Name : FSociety
Creators :

Rank:
  • Private Leaderboard Rank: 4
  • Public Leaderboard Rank: 5
This repository contains the code implemented during the hackathon.

Problem Statement :

Topic Modeling for Research Articles

Researchers have access to large online archives of scientific articles. As a consequence, finding relevant articles has become more difficult. Tagging or topic modelling provides a way to give token of identification to research articles which facilitates recommendation and search process.
Given the abstract and title for a set of research articles, predict the topics for each article included in the test set. Note that a research article can possibly have more than 1 topic. The research article abstracts and titles are sourced from the following 6 topics:

  • Computer Science
  • Physics
  • Mathematics
  • Statistics
  • Quantitative Biology
  • Quantitative Finance

Data :

Models :

  • BERT

    A language representation model, which stands for Bidirectional Encoder Representations from Transformers. BERT is a multi-layer bidirectional Transformer's encoder stack.

  • RoBERTa

    RoBERTa: A Robustly Optimized BERT Pretraining Approach. It builds on BERT and modifies key hyperparameters, removing the next-sentence pretraining objective and training with much larger mini-batches and learning rates.

  • ALBERT

    ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. ALBERT uses repeating layers which results in a small memory footprint, however the computational cost remains similar to a BERT-like architecture with the same number of hidden layers as it has to iterate through the same number of (repeating) layers.

  • Longformer

    Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length. To address this limitation, the Longformer was introduced with an attention mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer.

  • T5

    T5 is an encoder-decoder model pre-trained on a multi-task mixture of unsupervised and supervised tasks and for which each task is converted into a text-to-text format. T5 works well on a variety of tasks out-of-the-box by prepending a different prefix to the input corresponding to each task, e.g.: for translation: translate English to German, etc.

  • XLNet

    XLnet is an extension of the Transformer-XL model pre-trained using an autoregressive method to learn bidirectional contexts by maximizing the expected likelihood over all permutations of the input sequence factorization order.

  • SciBERT

    It is a BERT model trained on scientific text. SciBERT is trained on papers from the corpus of semanticscholar.org. Corpus size is 1.14M papers, 3.1B tokens

Performance of implemented models :

Model Public LB f1-micro Private LB f1-micro
bert-base-uncased 0.828077 0.827281
albert-base-v2 0.824307 0.824409
longformer-base-4096 0.833407 0.834856
roberta-base 0.810430 0.807715
siamese-roberta-base 0.831624 0.832628
roberta-large 0.823807 0.829286
xlnet 0.835541 0.837154
t5-base 0.824055 0.823033
scibert 0.845831 0.849557
scibert + lgbm 0.841710 0.845912
scibert + xgboost 0.844890 0.849602
siamese-scibert + gru 0.845310 0.853427
scibert + bilstm 0.846365 0.849017
scibert-fft 0.845831 0.849557
avg-blend 0.853915 0.857981
weighted-avg-blend 0.854491 0.858294

About

Multi-label topic classification for scientific articles using transformer models. Implemented and compared 7+ architectures (BERT, RoBERTa, SciBERT, XLNet, T5) with ensemble methods, achieving 85.8% F1-score on 6-topic classification

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •