This Python-based script is designed to scrape Telegram channel URLs based on user-provided keywords, using Selenium and BeautifulSoup. It allows for the extraction of Telegram channel IDs, the option to save results, and a method for combining and deduplicating extracted channel IDs from previous results.
- Keyword-based Search: Input keywords to find relevant Telegram channels.
- Dynamic Content Handling: Uses Selenium to interact with dynamic web content.
- Channel ID Extraction: Extracts unique Telegram channel IDs from URLs.
- Customizable Scraping: Control the number of pages to scrape and the sorting method (e.g., by date).
- Output Management: Saves extracted channels and IDs to text files with timestamped filenames.
- ID Deduplication: Combine and deduplicate channel IDs from previously saved results.
time
: Manages delays during the scraping process.os
: Handles file and directory operations.urllib.parse
: URL encoding for search keywords.json
: For loading keywords from a JSON file.selenium
: Web automation for dynamic content loading.webdriver
WebDriverWait
expected_conditions
By
webdriver_manager
: Automatic management of ChromeDriver.bs4 (BeautifulSoup)
: HTML parsing to extract URLs.datetime
: Adds timestamps to output filenames.re
: Regular expressions to extract Telegram channel IDs.
Install the dependencies using pip:
pip install selenium webdriver-manager beautifulsoup4
git clone https://github.com/MRAbbasi1/TelegramScraper.git
cd TelegramScraper
Ensure you have Python 3.7+ installed on your system. Follow these steps to set up the virtual environment and install dependencies:
-
Remove any existing virtual environment (optional):
rm -rf .venv
-
Create a new virtual environment:
python3 -m venv .venv
-
Activate the virtual environment:
- On Linux/macOS:
source .venv/bin/activate
- On Windows:
.venv\Scripts\activate
- On Linux/macOS:
-
Install the required dependencies:
pip install --upgrade pip pip install -r requirements.txt
The script provides two main functionalities:
- Searching for Telegram channels based on keywords.
- Extracting and deduplicating channel IDs from existing output files.
-
Run the main script:
- Using Python:
python run.py
- Alternatively (if
python3
is required):python3 run.py
- Using Python:
-
Activate the virtual environment before running the script (if not already active):
- On Linux/macOS:
source .venv/bin/activate
- On Windows:
.venv\Scripts\activate
- On Linux/macOS:
-
Ensure all necessary permissions are granted for accessing files or directories.
-
For troubleshooting, review the logs or error messages generated by the script.
-
Keep dependencies up-to-date using:
pip install --upgrade -r requirements.txt
-
If you encounter issues with Python versions or dependencies, consider using a tool like pyenv to manage multiple Python versions.
-
User Input Prompts
The script will prompt you to choose between two actions:
- Search for Keywords: Enter keywords to search Telegram channels.
- Extract and Deduplicate IDs: Combine and deduplicate previously saved channel IDs.
-
Output Files
The script saves the results in theOutput/
directory, with filenames containing the keyword and timestamp:- Example:
Output/music_2024-12-25_12-00-00.txt
- Optionally, unique channel IDs can be saved in separate files.
- Example:
-
Search for Channels:
Do you want to search for keywords or only extract IDs from the Output folder? (search/extract): search Enter the path to the keywords JSON file: keywords.json Do you want to extract and save unique channel IDs for each keyword? (yes/no): yes Enter the number of pages to scrape (e.g., 5): 3 Enter sorting option ('date' or leave empty for relevance): date
-
Extract and Deduplicate Channel IDs:
Do you want to search for keywords or only extract IDs from the Output folder? (search/extract): extract All unique channel IDs saved to Output/final_channel_ids_2024-12-25_12-00-00.txt
-
setup_driver()
: Sets up a headless Selenium WebDriver using WebDriver Manager. -
load_keywords_from_json(json_path)
: Loads keywords from a JSON file. Example format:{ "keywords": ["music", "technology", "news"] }
-
extract_channel_urls(driver, page_source)
: Extracts Telegram channel URLs from the page source using BeautifulSoup. -
extract_channel_ids(urls)
: Extracts unique channel IDs using regular expressions and returns them in sorted order. -
combine_all_channel_ids(output_dir)
: Combines and deduplicates all channel IDs from.txt
files in theOutput/
directory. -
scrape_telegram_channels(keywords, output_dir, max_tabs, sort_by, save_ids_decision)
: The main function that:- Constructs search URLs based on user-provided keywords.
- Loads pages using Selenium and waits for dynamic content.
- Extracts URLs and optionally saves channel IDs.
TelegramScraper/
├── run.py # Main script for scraping and extracting
├── Output/ # Directory for saving output files
│ ├── music_2024-12-25_12-00-00.txt # Example result file
│ ├── music_ids_2024-12-25_12-00-00.txt # Example ID file (if enabled)
└── README.md # Documentation (this file)
- Ensure Chrome is installed on your system, as the script uses
chromedriver
for Selenium. - If you encounter issues with
chromedriver
, update it usingwebdriver-manager
. - The website you scrape from must be accessible. If it changes structure, updates to the script may be required.
This script is for educational and personal use only. Web scraping may violate the terms of service of some websites. Use responsibly and ensure compliance with all applicable laws.
Feel free to contribute by submitting issues or pull requests. Improvements and bug fixes are always welcome!
This project is licensed under the MIT License.
- Load Keywords from JSON: You can now load keywords from a JSON file, making it easier to manage multiple search terms.
- Dynamic Scraping: The script dynamically scrapes content using Selenium, which is essential for websites that load content dynamically (e.g., via JavaScript).
- Channel ID Extraction: Unique channel IDs are extracted and saved in separate files if needed.
- ID Deduplication: A function has been added to combine and deduplicate channel IDs across multiple result files.