This project focuses on extracting named entities from Telegram messages, tokenizing the data, and preparing it for further analysis. It utilizes various Python scripts, notebooks, and testing frameworks to facilitate efficient data processing.
├── .vscode/ # Visual Studio Code settings
│ └── settings.json
├── .github/ # GitHub workflows
│ └── workflows/
│ └── unittests.yml
├── .gitignore # Ignored files in version control
├── requirements.txt # List of Python dependencies
├── README.md # Project documentation
├── src/ # Source code directory
│ └── __init__.py
├── notebooks/ # Jupyter notebooks for analysis
│ ├── tokenization.ipynb # Notebook for tokenizing text data
│ ├── PreprocessingDataLabelingStart.ipynb # Notebook for preprocessing data and labeling
│ ├── __init__.py
│ └── README.md # Documentation for notebooks
├── tests/ # Testing directory
│ └── __init__.py
│ ├── tokenize.py # Test scripts for tokenization functionality
├──scripts/ # Custom scripts
│ ├── __init__.py
│ └── telegram_scrapper.py # Script for scraping Telegram data
├──.env # Environment variables
├── final_telegram_tokens.csv # Output CSV file with final tokens
├── labeled_data_conll.txt # Labeled data in CoNLL format
├── scraping_session.session # Session data for scraping
└── telegram_data.csv # Raw dataset containing Telegram messages
This project extracts named entities such as locations, prices, and products from messages obtained from a Telegram channel. The workflow includes data scraping, preprocessing, tokenization, and labeling using both pre-trained models and custom rules.
- Python 3.x: Ensure you have a compatible version of Python installed.
- Virtual Environment (optional): It’s recommended to create a virtual environment for managing dependencies.
-
Clone the repository:
git clone https://github.com/Atnabon/EthioMart.git -
Install the required dependencies:
pip install -r requirements.txt
- Scraping: Run the
telegram_scrapper.pyscript to scrape data if required. - Data Preparation: Place the raw data in
telegram_data.csv. - Tokenization: Run the
tokenization.ipynbnotebook to tokenize the text data and generate tokens. - Data Preprocessing: Open
PreprocessingDataLabelingStart.ipynbto preprocess the raw data and set up labeling. - Testing: Use the tests in the
tests/directory to verify the functionality of your code. - Output: Final results will be saved in
final_telegram_tokens.csvandlabeled_data_conll.txt.
Contributions are welcome! Please fork the repository and create a pull request for any enhancements or bug fixes.
This project is open-source and available for modification or extension.