Turkish News Classification Project

This project provides two different models for classifying Turkish news articles into various categories such as sports, politics, economy, and more. It includes a classic machine learning baseline and a modern deep learning approach using a Transformer-based model.

Models

This repository contains two distinct models, each in its own directory.

1. `v1_baseline`: TF-IDF + Logistic Regression

This model serves as a simple and fast baseline for the text classification task.

Architecture: It uses a TF-IDF (Term Frequency-Inverse Document Frequency) vectorizer to convert text into numerical features and a Logistic Regression classifier to perform the classification.
Technology: Built with scikit-learn, pandas, and nltk.
Performance: This model provides a solid starting point. It performs very well for distinct categories like "spor" (sports) but struggles with more nuanced or overlapping categories.
Details: You can find the training script, saved model, and dependencies in the v1_baseline/ directory.

2. `v2_bert`: Fine-Tuned BERT Model

This is a more advanced model that leverages a pre-trained BERT model for higher accuracy and better semantic understanding.

Architecture: It uses the dbmdz/bert-base-turkish-cased model, a BERT model pre-trained on a large corpus of Turkish text. The model is then fine-tuned on the specific news classification dataset.
Technology: Built with Hugging Face Transformers, PyTorch, and datasets.
Performance: This model is expected to significantly outperform the baseline, as it can understand the context and semantics of the news text more effectively.
Details: The v2_bert/ directory contains everything needed to train the model, run predictions, and manage dependencies.

Project Structure

.
├── v1_baseline/
│   ├── train_baseline.py   # Script to train the TF-IDF + Logistic Regression model
│   ├── model/              # Saved baseline model
│   └── requirements.txt    # Dependencies for the baseline model
│
└── v2_bert/
    ├── train_bert_finetune.py  # Script to fine-tune the BERT model
    ├── predict_news.py         # Script to classify news with the BERT model
    ├── model/                  # Saved fine-tuned BERT model
    └── requirements.txt        # Dependencies for the BERT model

How to Use

First, clone the repository to your local machine.

Running the Baseline Model (v1)

Navigate to the baseline directory:
```
cd v1_baseline
```
Install dependencies:
```
pip install -r requirements.txt
```
Run the training script:
```
python train_baseline.py
```
This will train the model and save the baseline_model.pkl and vectorizer.pkl inside the v1_baseline/model/ directory.

Running the BERT Model (v2)

Navigate to the BERT directory:
```
cd v2_bert
```
Install dependencies:
```
pip install -r requirements.txt
```
Run the training script:
```
python train_bert_finetune.py
```
This will fine-tune the BERT model and save the best version to the v2_bert/model/fine_tuned_bert/ directory.
Make predictions: To classify new text, run the interactive prediction script:
```
python predict_news.py
```

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
v1_baseline		v1_baseline
v2_bert		v2_bert
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Turkish News Classification Project

Models

1. `v1_baseline`: TF-IDF + Logistic Regression

2. `v2_bert`: Fine-Tuned BERT Model

Project Structure

How to Use

Running the Baseline Model (v1)

Running the BERT Model (v2)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Turkish News Classification Project

Models

1. v1_baseline: TF-IDF + Logistic Regression

2. v2_bert: Fine-Tuned BERT Model

Project Structure

How to Use

Running the Baseline Model (v1)

Running the BERT Model (v2)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. `v1_baseline`: TF-IDF + Logistic Regression

2. `v2_bert`: Fine-Tuned BERT Model

Packages