Telegram Data Processing Project

This project focuses on extracting named entities from Telegram messages, tokenizing the data, and preparing it for further analysis. It utilizes various Python scripts, notebooks, and testing frameworks to facilitate efficient data processing.

Project Structure

├── .vscode/                       # Visual Studio Code settings
│   └── settings.json
├── .github/                       # GitHub workflows
│   └── workflows/
│       └── unittests.yml
├── .gitignore                     # Ignored files in version control
├── requirements.txt               # List of Python dependencies
├── README.md                      # Project documentation
├── src/                           # Source code directory
│   └── __init__.py
├── notebooks/                     # Jupyter notebooks for analysis
│   ├── tokenization.ipynb         # Notebook for tokenizing text data
│   ├── PreprocessingDataLabelingStart.ipynb  # Notebook for preprocessing data and labeling
│   ├── __init__.py
│   └── README.md                  # Documentation for notebooks
├── tests/                         # Testing directory
│   └── __init__.py
│   ├── tokenize.py                # Test scripts for tokenization functionality
├──scripts/                       # Custom scripts
│    ├── __init__.py
│     └── telegram_scrapper.py       # Script for scraping Telegram data
├──.env                           # Environment variables
├── final_telegram_tokens.csv      # Output CSV file with final tokens
├── labeled_data_conll.txt         # Labeled data in CoNLL format
├── scraping_session.session        # Session data for scraping
└── telegram_data.csv              # Raw dataset containing Telegram messages

Overview

This project extracts named entities such as locations, prices, and products from messages obtained from a Telegram channel. The workflow includes data scraping, preprocessing, tokenization, and labeling using both pre-trained models and custom rules.

Getting Started

Prerequisites

Python 3.x: Ensure you have a compatible version of Python installed.
Virtual Environment (optional): It’s recommended to create a virtual environment for managing dependencies.

Installation

Clone the repository:

git clone https://github.com/Atnabon/EthioMart.git

Install the required dependencies:
```
pip install -r requirements.txt
```

How to Use

Scraping: Run the telegram_scrapper.py script to scrape data if required.
Data Preparation: Place the raw data in telegram_data.csv.
Tokenization: Run the tokenization.ipynb notebook to tokenize the text data and generate tokens.
Data Preprocessing: Open PreprocessingDataLabelingStart.ipynb to preprocess the raw data and set up labeling.
Testing: Use the tests in the tests/ directory to verify the functionality of your code.
Output: Final results will be saved in final_telegram_tokens.csv and labeled_data_conll.txt.

Contribution

Contributions are welcome! Please fork the repository and create a pull request for any enhancements or bug fixes.

License

This project is open-source and available for modification or extension.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Telegram Data Processing Project

Project Structure

Overview

Getting Started

Prerequisites

Installation

How to Use

Contribution

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github/workflow		.github/workflow
notebooks		notebooks
scripts		scripts
src		src
tests		tests
venv		venv
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Atnabon/EthioMart

Folders and files

Latest commit

History

Repository files navigation

Telegram Data Processing Project

Project Structure

Overview

Getting Started

Prerequisites

Installation

How to Use

Contribution

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages