Skip to content

Atnabon/EthioMart

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Telegram Data Processing Project

This project focuses on extracting named entities from Telegram messages, tokenizing the data, and preparing it for further analysis. It utilizes various Python scripts, notebooks, and testing frameworks to facilitate efficient data processing.

Project Structure

├── .vscode/                       # Visual Studio Code settings
│   └── settings.json
├── .github/                       # GitHub workflows
│   └── workflows/
│       └── unittests.yml
├── .gitignore                     # Ignored files in version control
├── requirements.txt               # List of Python dependencies
├── README.md                      # Project documentation
├── src/                           # Source code directory
│   └── __init__.py
├── notebooks/                     # Jupyter notebooks for analysis
│   ├── tokenization.ipynb         # Notebook for tokenizing text data
│   ├── PreprocessingDataLabelingStart.ipynb  # Notebook for preprocessing data and labeling
│   ├── __init__.py
│   └── README.md                  # Documentation for notebooks
├── tests/                         # Testing directory
│   └── __init__.py
│   ├── tokenize.py                # Test scripts for tokenization functionality
├──scripts/                       # Custom scripts
│    ├── __init__.py
│     └── telegram_scrapper.py       # Script for scraping Telegram data
├──.env                           # Environment variables
├── final_telegram_tokens.csv      # Output CSV file with final tokens
├── labeled_data_conll.txt         # Labeled data in CoNLL format
├── scraping_session.session        # Session data for scraping
└── telegram_data.csv              # Raw dataset containing Telegram messages

Overview

This project extracts named entities such as locations, prices, and products from messages obtained from a Telegram channel. The workflow includes data scraping, preprocessing, tokenization, and labeling using both pre-trained models and custom rules.

Getting Started

Prerequisites

  • Python 3.x: Ensure you have a compatible version of Python installed.
  • Virtual Environment (optional): It’s recommended to create a virtual environment for managing dependencies.

Installation

  1. Clone the repository:

    git clone https://github.com/Atnabon/EthioMart.git
    
  2. Install the required dependencies:

    pip install -r requirements.txt
    

How to Use

  1. Scraping: Run the telegram_scrapper.py script to scrape data if required.
  2. Data Preparation: Place the raw data in telegram_data.csv.
  3. Tokenization: Run the tokenization.ipynb notebook to tokenize the text data and generate tokens.
  4. Data Preprocessing: Open PreprocessingDataLabelingStart.ipynb to preprocess the raw data and set up labeling.
  5. Testing: Use the tests in the tests/ directory to verify the functionality of your code.
  6. Output: Final results will be saved in final_telegram_tokens.csv and labeled_data_conll.txt.

Contribution

Contributions are welcome! Please fork the repository and create a pull request for any enhancements or bug fixes.

License

This project is open-source and available for modification or extension.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published