Portuguese Wikipedia Dump Extraction

This project downloads the latest Portuguese Wikipedia XML dump and extracts filtered, cleaned article text into JSONL and Parquet for downstream NLP/LLM workflows.

Source dump:

https://dumps.wikimedia.org/ptwiki/latest/ptwiki-latest-pages-articles.xml.bz2

Setup

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Run

Quick subset:

python -m wiki_pt_extract.cli --max-pages 2000 --skip-redirects

Full extraction:

python -m wiki_pt_extract.cli --skip-redirects

The program creates data/ and data/raw/ automatically.

Filtering Pipeline (Execution Order)

Filtering happens in build_filtered_rows() and is applied in this order:

Read each <page> from the compressed XML dump (iter_pages_from_bz2), extracting:
- title
- page_id
- ns (namespace)
- latest revision text
Stop early if --max-pages is reached.
Namespace filter:
- default: keep only main namespace (ns == 0)
- use --include-non-main to keep other namespaces
Text normalization:
- normalize line breaks (\r\n / \r -> \n)
- trim leading/trailing whitespace
Wikitext cleaning (_clean_wikitext):
- optionally remove list-like lines (*, #, ;, :) unless --keep-lists is set
- remove media/file wikilinks (File:, Ficheiro:, Imagem:, Image:)
- remove all templates
- remove noisy tags: ref, references, table, gallery, math, code, syntaxhighlight, timeline, pre, source
- split into sections (lead included), strip markup to plain text, and apply regex cleanup for media/options leftovers, empty brackets, and excess whitespace
- keep only sections with at least --min-section-chars characters (default: 1)
- join kept sections into final text, and also store per-section output in section_texts
Empty-content removal:
- if cleaned text is empty, the page is dropped
Redirect filtering:
- applied only when --skip-redirects is set
- drops pages whose cleaned text starts with #redirect or #redirecionamento (case-insensitive)
Disambiguation filtering:
- default: drop disambiguation pages
- current implementation checks title pattern like (desambiguação) and also applies a {{desambigua...}} regex over the current text value
- use --include-disambiguation to keep them
Remaining pages are written to JSONL and Parquet.

At the end, the CLI prints counters for:

pages seen
pages written
redirects skipped
empty texts skipped
non-main namespace skipped
disambiguation skipped

Deduplication Behavior

There is currently no explicit post-cleaning deduplication step in the pipeline (for example, no dedup by page_id, title, or text hash).

In practice, deduplication mostly relies on dump structure:

pages-articles already provides one current revision per page entry
if two different pages have identical cleaned content, both are kept

Output Files

data/ptwiki_articles1.jsonl
data/ptwiki_articles1.parquet
shard files during parquet generation:
- data/ptwiki_articles1_part_00001.parquet, data/ptwiki_articles1_part_00002.parquet, ...

Parquet writing is batched, then shards are merged into data/ptwiki_articles1.parquet.

Schema

Each row contains:

text: cleaned plain text
title: page title
page_id: page ID from XML
ns: namespace ID
section_texts: list of cleaned section texts (lead included)

Filters and Flags

Behavior	Default	Flag to change
Keep only main namespace (`ns == 0`)	Enabled	`--include-non-main`
Remove disambiguation pages	Enabled	`--include-disambiguation`
Remove redirect pages	Disabled	`--skip-redirects`
Remove list-like lines from text	Enabled	`--keep-lists`
Minimum section length	`1` char	`--min-section-chars <int>`
Max processed pages	Unlimited	`--max-pages <int>`

Publish to Hugging Face

Dataset name: wikipedia-pt-br-extract

Generate outputs:

python -m wiki_pt_extract.cli --skip-redirects

Upload:

pip install -r requirements.txt
python scripts/publish_hf_dataset.py --repo wikipedia-pt-br-extract

Upload target: https://huggingface.co/datasets/<your-username>/wikipedia-pt-br-extract

Token Counting (Qwen3 tokenizer)

JSONL:

python scripts/count_tokens.py --input data/ptwiki_articles1.jsonl --format jsonl

Parquet:

python scripts/count_tokens.py --input data/ptwiki_articles1.parquet --format parquet

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
reports		reports
scripts		scripts
src/wiki_pt_extract		src/wiki_pt_extract
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
image.png		image.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Portuguese Wikipedia Dump Extraction

Setup

Run

Filtering Pipeline (Execution Order)

Deduplication Behavior

Output Files

Schema

Filters and Flags

Publish to Hugging Face

Token Counting (Qwen3 tokenizer)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Portuguese Wikipedia Dump Extraction

Setup

Run

Filtering Pipeline (Execution Order)

Deduplication Behavior

Output Files

Schema

Filters and Flags

Publish to Hugging Face

Token Counting (Qwen3 tokenizer)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages