Kurmanji Scraping (legacy)

This repository is an older, outlet-specific Scrapy project focused on crawling a handful of Kurdish (Kurmanji) news sites with per-site spiders.

If you’re starting a new crawl today, prefer the newer project:

Recommended: kurdish_scrapy
Why it’s better:
- Crawls arbitrary domains from a simple JSON list (not one spider per outlet)
- Uses sitemap-first crawling with a recursive fallback
- Extracts clean article text via Trafilatura
- Filters results by Kurdish variants using FastText language ID
- Supports kmr_Latn (Kurmanji), ckb_Arab (Sorani), diq_Latn (Zazaki)
- Optional ScrapeOps user-agent rotation

Dataset

An example dataset produced from this repo is published here: https://huggingface.co/datasets/muzaffercky/kurdish-kurmanji-news

Running this repo (if you still need it)

Prerequisites

Python 3.12 (see Pipfile)
Pipenv
Optional: ScrapeOps API key (user-agent rotation)

Setup

Create a .env file (don’t commit secrets):

SCRAPEOPS_API_KEY="your_api_key_here"

Run a spider

Example:

scrapy crawl xwebun -o {file}

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
kurmanjiscraping		kurmanjiscraping
.gitignore		.gitignore
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
rows_count.py		rows_count.py
scrapy.cfg		scrapy.cfg
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kurmanji Scraping (legacy)

Dataset

Running this repo (if you still need it)

Prerequisites

Setup

Run a spider

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Kurmanji Scraping (legacy)

Dataset

Running this repo (if you still need it)

Prerequisites

Setup

Run a spider

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages