This project downloads the latest Portuguese Wikipedia XML dump and extracts filtered, cleaned article text into JSONL and Parquet for downstream NLP/LLM workflows.
Source dump:
https://dumps.wikimedia.org/ptwiki/latest/ptwiki-latest-pages-articles.xml.bz2
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtQuick subset:
python -m wiki_pt_extract.cli --max-pages 2000 --skip-redirectsFull extraction:
python -m wiki_pt_extract.cli --skip-redirectsThe program creates data/ and data/raw/ automatically.
Filtering happens in build_filtered_rows() and is applied in this order:
- Read each
<page>from the compressed XML dump (iter_pages_from_bz2), extracting:titlepage_idns(namespace)- latest revision
text
- Stop early if
--max-pagesis reached. - Namespace filter:
- default: keep only main namespace (
ns == 0) - use
--include-non-mainto keep other namespaces
- default: keep only main namespace (
- Text normalization:
- normalize line breaks (
\r\n/\r->\n) - trim leading/trailing whitespace
- normalize line breaks (
- Wikitext cleaning (
_clean_wikitext):- optionally remove list-like lines (
*,#,;,:) unless--keep-listsis set - remove media/file wikilinks (
File:,Ficheiro:,Imagem:,Image:) - remove all templates
- remove noisy tags:
ref,references,table,gallery,math,code,syntaxhighlight,timeline,pre,source - split into sections (lead included), strip markup to plain text, and apply regex cleanup for media/options leftovers, empty brackets, and excess whitespace
- keep only sections with at least
--min-section-charscharacters (default:1) - join kept sections into final
text, and also store per-section output insection_texts
- optionally remove list-like lines (
- Empty-content removal:
- if cleaned
textis empty, the page is dropped
- if cleaned
- Redirect filtering:
- applied only when
--skip-redirectsis set - drops pages whose cleaned text starts with
#redirector#redirecionamento(case-insensitive)
- applied only when
- Disambiguation filtering:
- default: drop disambiguation pages
- current implementation checks title pattern like
(desambiguação)and also applies a{{desambigua...}}regex over the currenttextvalue - use
--include-disambiguationto keep them
- Remaining pages are written to JSONL and Parquet.
At the end, the CLI prints counters for:
- pages seen
- pages written
- redirects skipped
- empty texts skipped
- non-main namespace skipped
- disambiguation skipped
There is currently no explicit post-cleaning deduplication step in the pipeline (for example, no dedup by page_id, title, or text hash).
In practice, deduplication mostly relies on dump structure:
pages-articlesalready provides one current revision per page entry- if two different pages have identical cleaned content, both are kept
data/ptwiki_articles1.jsonldata/ptwiki_articles1.parquet- shard files during parquet generation:
data/ptwiki_articles1_part_00001.parquet,data/ptwiki_articles1_part_00002.parquet, ...
Parquet writing is batched, then shards are merged into data/ptwiki_articles1.parquet.
Each row contains:
text: cleaned plain texttitle: page titlepage_id: page ID from XMLns: namespace IDsection_texts: list of cleaned section texts (lead included)
| Behavior | Default | Flag to change |
|---|---|---|
Keep only main namespace (ns == 0) |
Enabled | --include-non-main |
| Remove disambiguation pages | Enabled | --include-disambiguation |
| Remove redirect pages | Disabled | --skip-redirects |
| Remove list-like lines from text | Enabled | --keep-lists |
| Minimum section length | 1 char |
--min-section-chars <int> |
| Max processed pages | Unlimited | --max-pages <int> |
Dataset name: wikipedia-pt-br-extract
- Generate outputs:
python -m wiki_pt_extract.cli --skip-redirects- Upload:
pip install -r requirements.txt
python scripts/publish_hf_dataset.py --repo wikipedia-pt-br-extractUpload target:
https://huggingface.co/datasets/<your-username>/wikipedia-pt-br-extract
JSONL:
python scripts/count_tokens.py --input data/ptwiki_articles1.jsonl --format jsonlParquet:
python scripts/count_tokens.py --input data/ptwiki_articles1.parquet --format parquet