Kurdish Text Data Collector

A Scrapy package based web scraper for collecting Kurdish text data from websites. The tool recursively crawls specified domains, extracts article content using Trafilatura, and filters results by language using Facebook's FastText language identification model.

Features

Recursive crawling - Crawls entire websites following internal links
Language detection - Filters content by Kurdish language variants:
- kmr_Latn - Kurmanji (Northern Kurdish, Latin script)
- ckb_Arab - Sorani (Central Kurdish, Arabic script)
- diq_Latn - Zazaki (Latin script)
Content extraction - Extracts clean article text, title, and metadata using Trafilatura
Smart filtering - Skips media files, non-HTML content, and short texts
Anti-bot protection - Rotates user agents via ScrapeOps
Duplicate handling - Built-in URL deduplication

Prerequisites

Python 3.10
Pipenv
ScrapeOps API key (optional, free tier available)

Installation

Clone the repository:

git clone git@github.com:cikay/kurdish_scrapy.git
cd kurdish_scrapy

Create and activate virtual environment:

pipenv --python 3.10
pipenv shell

Install dependencies:

pipenv install

Create a .env file with your configuration:

ALLOWED_LANGS="kmr_Latn,ckb_Arab,diq_Latn"
TEXT_MIN_WORD_COUNT=100
# Optional
# SCRAPEOPS_API_KEY="your_api_key_here"

Configuration

Variable	Description	Default
`SCRAPEOPS_API_KEY`	API key for ScrapeOps user agent rotation	Optional
`ALLOWED_LANGS`	Comma-separated language codes to collect	`kmr_Latn,ckb_Arab,diq_Latn`
`TEXT_MIN_WORD_COUNT`	Minimum word count for collected texts	`100`

Note: SCRAPEOPS_API_KEY is currently optional and scraping may still work without it. If this changes in the future and requests start failing, either:

obtain a valid ScrapeOps API key, or
remove kurdish_scrapy.middlewares.ScrapeOpsFakeUserAgentMiddleware from DOWNLOADER_MIDDLEWARES in kurdish_scrapy/settings.py.

Usage

Configure target domains

Edit kurdish_domains.json and list the domains you want to crawl:

[
    "https://www.nuhev.com/",
    "https://ajansawelat.com/"
]

Run the app

python main.py --output output.csv

For production/server runs, write logs to a file so crashes are preserved:

python main.py --output output.csv --log-file logs/crawler.log --log-level INFO

main.py reads kurdish_domains.json and passes those domains to run_crawler.py. For each domain, the runner tries SitemapSpider first (using robots.txt and common sitemap paths). If no sitemap is found, it falls back to RecursiveSpider.

Supported output formats: .csv, .json, .jsonl

Check collected data statistics

python rows_count.py --file-name output.csv

This displays:

Total row count
Unique titles, URLs, and texts count

Run benchmark mode

Use bencmark.py to benchmark one domain by running both spiders sequentially (SitemapSpider first, then RecursiveSpider) and writing timing logs.

Arguments:

--domain: Required start URL/domain to crawl (use full URL, e.g. https://www.nuhev.com)
--sitemap: Output file for sitemap crawl (.csv, .json, or .jsonl)
--recursive: Output file for recursive crawl (.csv, .json, or .jsonl)
--benchmark-log (optional): Log file path for timing details (default: benchmark.log)

Example with default log path:

python bencmark.py --domain https://www.nuhev.com --sitemap sitemap_output.csv --recursive recursive_output.csv

Output Format

The spider outputs the following fields:

Field	Description
`text`	Extracted article content
`title`	Article title
`url`	Source URL
`publisher`	Website/publisher name
`word_count`	Word count (calculated by whitespace splitting)
`lang`	Detected language code
`lang_score`	Language detection confidence score
`source_type`	Content type (default: `news`)

Project Structure

├── kurdish_scrapy/
│   ├── spiders/
│   │   ├── sitemap.py        # Sitemap-based spider
│   │   ├── recursive.py      # Recursive fallback spider
│   │   └── base.py           # Shared spider base class
│   ├── items.py              # Data item schema
│   ├── middlewares.py        # User agent rotation & URL filtering
│   ├── pipelines.py          # Language & length filtering
│   ├── settings.py           # Scrapy configuration
│   └── lang_model.py         # FastText language model loader
├── extractor/
│   ├── text_extractor.py     # Trafilatura-based content extraction
│   ├── url_extractor.py      # URL parsing and filtering
│   └── protocol.py           # Extractor protocol interface
├── run_crawler.py            # Spider selection + feed setup
├── main.py                   # CLI entrypoint
├── kurdish_domains.json      # Crawl target domains
├── bencmark.py               # Sitemap vs recursive benchmark runner
├── rows_count.py             # Utility for data statistics
├── Pipfile                   # Dependencies
└── .env                      # Environment variables (create this)

Name		Name	Last commit message	Last commit date
Latest commit History 142 Commits
.github		.github
api		api
extractor		extractor
kurdish_scrapy		kurdish_scrapy
.gitignore		.gitignore
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
bencmark.py		bencmark.py
get_kurdish_publisher.py		get_kurdish_publisher.py
kurdish_domains.csv		kurdish_domains.csv
kurdish_domains.json		kurdish_domains.json
main.py		main.py
rows_count.py		rows_count.py
run_crawler.py		run_crawler.py
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Kurdish Text Data Collector

Features

Prerequisites

Installation

Configuration

Usage

Configure target domains

Run the app

Check collected data statistics

Run benchmark mode

Output Format

Project Structure

About

Uh oh!

Releases

Sponsor this project

Uh oh!

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Kurdish Text Data Collector

Features

Prerequisites

Installation

Configuration

Usage

Configure target domains

Run the app

Check collected data statistics

Run benchmark mode

Output Format

Project Structure

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages