Skip to content

costadev00/wikipedia-dump

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Project image

Portuguese Wikipedia Dump Extraction

This project downloads the latest Portuguese Wikipedia XML dump and extracts filtered, cleaned article text into JSONL and Parquet for downstream NLP/LLM workflows.

Source dump:

  • https://dumps.wikimedia.org/ptwiki/latest/ptwiki-latest-pages-articles.xml.bz2

Setup

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Run

Quick subset:

python -m wiki_pt_extract.cli --max-pages 2000 --skip-redirects

Full extraction:

python -m wiki_pt_extract.cli --skip-redirects

The program creates data/ and data/raw/ automatically.

Filtering Pipeline (Execution Order)

Filtering happens in build_filtered_rows() and is applied in this order:

  1. Read each <page> from the compressed XML dump (iter_pages_from_bz2), extracting:
    • title
    • page_id
    • ns (namespace)
    • latest revision text
  2. Stop early if --max-pages is reached.
  3. Namespace filter:
    • default: keep only main namespace (ns == 0)
    • use --include-non-main to keep other namespaces
  4. Text normalization:
    • normalize line breaks (\r\n / \r -> \n)
    • trim leading/trailing whitespace
  5. Wikitext cleaning (_clean_wikitext):
    • optionally remove list-like lines (*, #, ;, :) unless --keep-lists is set
    • remove media/file wikilinks (File:, Ficheiro:, Imagem:, Image:)
    • remove all templates
    • remove noisy tags: ref, references, table, gallery, math, code, syntaxhighlight, timeline, pre, source
    • split into sections (lead included), strip markup to plain text, and apply regex cleanup for media/options leftovers, empty brackets, and excess whitespace
    • keep only sections with at least --min-section-chars characters (default: 1)
    • join kept sections into final text, and also store per-section output in section_texts
  6. Empty-content removal:
    • if cleaned text is empty, the page is dropped
  7. Redirect filtering:
    • applied only when --skip-redirects is set
    • drops pages whose cleaned text starts with #redirect or #redirecionamento (case-insensitive)
  8. Disambiguation filtering:
    • default: drop disambiguation pages
    • current implementation checks title pattern like (desambiguação) and also applies a {{desambigua...}} regex over the current text value
    • use --include-disambiguation to keep them
  9. Remaining pages are written to JSONL and Parquet.

At the end, the CLI prints counters for:

  • pages seen
  • pages written
  • redirects skipped
  • empty texts skipped
  • non-main namespace skipped
  • disambiguation skipped

Deduplication Behavior

There is currently no explicit post-cleaning deduplication step in the pipeline (for example, no dedup by page_id, title, or text hash).

In practice, deduplication mostly relies on dump structure:

  • pages-articles already provides one current revision per page entry
  • if two different pages have identical cleaned content, both are kept

Output Files

  • data/ptwiki_articles1.jsonl
  • data/ptwiki_articles1.parquet
  • shard files during parquet generation:
    • data/ptwiki_articles1_part_00001.parquet, data/ptwiki_articles1_part_00002.parquet, ...

Parquet writing is batched, then shards are merged into data/ptwiki_articles1.parquet.

Schema

Each row contains:

  • text: cleaned plain text
  • title: page title
  • page_id: page ID from XML
  • ns: namespace ID
  • section_texts: list of cleaned section texts (lead included)

Filters and Flags

Behavior Default Flag to change
Keep only main namespace (ns == 0) Enabled --include-non-main
Remove disambiguation pages Enabled --include-disambiguation
Remove redirect pages Disabled --skip-redirects
Remove list-like lines from text Enabled --keep-lists
Minimum section length 1 char --min-section-chars <int>
Max processed pages Unlimited --max-pages <int>

Publish to Hugging Face

Dataset name: wikipedia-pt-br-extract

  1. Generate outputs:
python -m wiki_pt_extract.cli --skip-redirects
  1. Upload:
pip install -r requirements.txt
python scripts/publish_hf_dataset.py --repo wikipedia-pt-br-extract

Upload target: https://huggingface.co/datasets/<your-username>/wikipedia-pt-br-extract

Token Counting (Qwen3 tokenizer)

JSONL:

python scripts/count_tokens.py --input data/ptwiki_articles1.jsonl --format jsonl

Parquet:

python scripts/count_tokens.py --input data/ptwiki_articles1.parquet --format parquet

Releases

No releases published

Packages

 
 
 

Contributors

Languages