A Scrapy package based web scraper for collecting Kurdish text data from websites. The tool recursively crawls specified domains, extracts article content using Trafilatura, and filters results by language using Facebook's FastText language identification model.
- Recursive crawling - Crawls entire websites following internal links
- Language detection - Filters content by Kurdish language variants:
kmr_Latn- Kurmanji (Northern Kurdish, Latin script)ckb_Arab- Sorani (Central Kurdish, Arabic script)diq_Latn- Zazaki (Latin script)
- Content extraction - Extracts clean article text, title, and metadata using Trafilatura
- Smart filtering - Skips media files, non-HTML content, and short texts
- Anti-bot protection - Rotates user agents via ScrapeOps
- Duplicate handling - Built-in URL deduplication
- Python 3.10
- Pipenv
- ScrapeOps API key (optional, free tier available)
- Clone the repository:
git clone git@github.com:cikay/kurdish_scrapy.git
cd kurdish_scrapy- Create and activate virtual environment:
pipenv --python 3.10
pipenv shell- Install dependencies:
pipenv install- Create a
.envfile with your configuration:
ALLOWED_LANGS="kmr_Latn,ckb_Arab,diq_Latn"
TEXT_MIN_WORD_COUNT=100
# Optional
# SCRAPEOPS_API_KEY="your_api_key_here"| Variable | Description | Default |
|---|---|---|
SCRAPEOPS_API_KEY |
API key for ScrapeOps user agent rotation | Optional |
ALLOWED_LANGS |
Comma-separated language codes to collect | kmr_Latn,ckb_Arab,diq_Latn |
TEXT_MIN_WORD_COUNT |
Minimum word count for collected texts | 100 |
Note: SCRAPEOPS_API_KEY is currently optional and scraping may still work without it. If this changes in the future and requests start failing, either:
- obtain a valid ScrapeOps API key, or
- remove
kurdish_scrapy.middlewares.ScrapeOpsFakeUserAgentMiddlewarefromDOWNLOADER_MIDDLEWARESinkurdish_scrapy/settings.py.
Edit kurdish_domains.json and list the domains you want to crawl:
[
"https://www.nuhev.com/",
"https://ajansawelat.com/"
]python main.py --output output.csvFor production/server runs, write logs to a file so crashes are preserved:
python main.py --output output.csv --log-file logs/crawler.log --log-level INFOmain.py reads kurdish_domains.json and passes those domains to run_crawler.py.
For each domain, the runner tries SitemapSpider first (using robots.txt and common sitemap paths). If no sitemap is found, it falls back to RecursiveSpider.
Supported output formats: .csv, .json, .jsonl
python rows_count.py --file-name output.csvThis displays:
- Total row count
- Unique titles, URLs, and texts count
Use bencmark.py to benchmark one domain by running both spiders sequentially (SitemapSpider first, then RecursiveSpider) and writing timing logs.
Arguments:
--domain: Required start URL/domain to crawl (use full URL, e.g.https://www.nuhev.com)--sitemap: Output file for sitemap crawl (.csv,.json, or.jsonl)--recursive: Output file for recursive crawl (.csv,.json, or.jsonl)--benchmark-log(optional): Log file path for timing details (default:benchmark.log)
Example with default log path:
python bencmark.py --domain https://www.nuhev.com --sitemap sitemap_output.csv --recursive recursive_output.csvThe spider outputs the following fields:
| Field | Description |
|---|---|
text |
Extracted article content |
title |
Article title |
url |
Source URL |
publisher |
Website/publisher name |
word_count |
Word count (calculated by whitespace splitting) |
lang |
Detected language code |
lang_score |
Language detection confidence score |
source_type |
Content type (default: news) |
├── kurdish_scrapy/
│ ├── spiders/
│ │ ├── sitemap.py # Sitemap-based spider
│ │ ├── recursive.py # Recursive fallback spider
│ │ └── base.py # Shared spider base class
│ ├── items.py # Data item schema
│ ├── middlewares.py # User agent rotation & URL filtering
│ ├── pipelines.py # Language & length filtering
│ ├── settings.py # Scrapy configuration
│ └── lang_model.py # FastText language model loader
├── extractor/
│ ├── text_extractor.py # Trafilatura-based content extraction
│ ├── url_extractor.py # URL parsing and filtering
│ └── protocol.py # Extractor protocol interface
├── run_crawler.py # Spider selection + feed setup
├── main.py # CLI entrypoint
├── kurdish_domains.json # Crawl target domains
├── bencmark.py # Sitemap vs recursive benchmark runner
├── rows_count.py # Utility for data statistics
├── Pipfile # Dependencies
└── .env # Environment variables (create this)