Skip to content

Latest commit

 

History

History
337 lines (245 loc) · 31.8 KB

File metadata and controls

337 lines (245 loc) · 31.8 KB

Wikipedia Good and Featured Article Extraction

Dataset is on the HuggingFace Hub: ucrelnlp/wikipedia-ga-fa-ids

This project allows you to extract out the Wikipedia article ID's and titles for articles that are categorised as Good Articles (GA) or Featured Articles (FA) for a list of Wikipedia language sites or just one language site. The data comes from Wikipedia/Wikimedia SQL table dumps, whereby the dump of data is date/time specific, of which Wikimedia dumps data monthly and only keeps the last 3-4 months of data dumps publicly available.

The repository is split into how to download and then create/extract the final JSONL data for a given set of languages and a given Wikimedia data dump timestamp of which this data is now available on HuggingFace at ucrelnlp/wikipedia-ga-fa-ids. A set of Python scripts that extract dataset statistics from the JSONL data. Before running any of the scripts please read the setup guide.

Good and Featured Articles definition

All good and featured articles have been assigned this rating by an editor. he "Featured" and "Good" article can be defined differently for each Wikipedia language site as stated in the English site definition within the following article. A guide on the number of "Good Articles" (GA) and "Featured Articles" (FA) per language can be found at their respective links, this is more of a guide as it changes over time with articles either being promoted to good or featured or demoted to either good or removed completely from either rating.

Setup

You can either use the dev container with your favourite editor, e.g. VSCode. Or you can create your setup locally below we demonstrate both.

In both cases they share the same tools, of which these tools are:

  • uv for Python packaging and development
  • docker for running the Maria SQL database for the Wikipedia SQL data.
  • libmariadb3 and libmariadb-dev libraries -- MariaDB database connector for the SQL database.
  • make (OPTIONAL) for automation of tasks, not strictly required but makes life easier.
  • yq - this reads YAML files from the command line and within a bash script.

Dev Container

Note this might not work on non-x86 architecture computers like Apple Silicon computers, this limitation is specified here

A dev container uses a docker container to create the required development environment, the Dockerfile we use for this dev container can be found at ./.devcontainer/Dockerfile. To run it locally it requires docker to be installed, you can also run it in a cloud based code editor, for a list of supported editors/cloud editors see the following webpage.

To run for the first time on a local VSCode editor (a slightly more detailed and better guide on the VSCode website):

  1. Ensure docker is running.
  2. Ensure the VSCode Dev Containers extension is installed in your VSCode editor.
  3. Open the command pallete CMD + SHIFT + P and then select Dev Containers: Rebuild and Reopen in Container

You should now have everything you need to develop, uv, make, for VSCode various extensions like Pylance, etc.

If you have any trouble see the VSCode website..

Local

To run locally first ensure you have the following tools installted locally:

  • uv for Python packaging and development. (version 0.9.6)
  • docker
  • libmariadb3 and libmariadb-dev libraries -- MariaDB database connector for the SQL database.
  • make (OPTIONAL) for automation of tasks, not strictly required but makes life easier.
    • Ubuntu: apt-get install make
    • Mac: Xcode command line tools includes make else you can use brew.
    • Windows: Various solutions proposed in this blog post on how to install on Windows, inclduing Cygwin, and Windows Subsystem for Linux.
  • yq - please follow the instructions in the README of that project.

When developing on the project you will want to install all of the Python packages, this can be done like so:

uv sync

Linting

Linting and formatting with ruff it is a replacement for tools like Flake8, isort, Black etc, and we us ty for type checking.

To run the linting:

make lint

Tests

For the Wikipedia data dump of 2025-12-01 for the following languages; English, Dutch, Spanish, Danish, Korean, Italian, Portuguese, Chinese, and Finnish we have a test that checks if an expected Wikipedia article title is in the data assigned to the correct good or featured category:

uv run scripts/find_title.py run-tests ./tests/find_title_test_data.jsonl ./data/wikipedia_fa_ga/

Download Wikipedia SQL Dumps

To download a one language, in this case English (en), SQL data dumps required to get the relevant meta data for Good and Featured articles, whereby the data dump downloaded will be for the timestamp 2025-12-01 (YYYY-MM-DD), and save the 3 SQL GZiped tables too ./data/wikipedia_sql_tables/en (the last argument 0 means do not overwrite any existing data if it exists):

bash scripts/download_meta_data.sh en 20251201 ./data/wikipedia_sql_tables/en 0

To get the metadata on which articles are "Good" or "Featured" we will use a combination of 3 SQL tables and a subset of their fields that are relevant;

  • Page - Stores all of the information for each page in MediaWiki (Including the main Wikipedia site).
    • page_id - The unique ID for each MediaWiki page.
    • page_title - The title of the page, whereby spaces have been replaced with _ and has a maximum of 255 characters. As we want only Main/Article Wikipedia pages as defined by the Wikipedia namespace categorisation system the title may start with an optional : that we will remove if it does exist.
    • page_namespace - The namespace Wikipedia categorisation system of the page, we are only interested in Main/Article Wikipedia pages that are defined as the 0 category.
  • Category Links - This table in essence links Pages to the categories that are stored in Link Target.
    • cl_from - The unique ID for each MediaWiki page.
    • cl_target_id - The unique ID for each category in Link Target.
  • Link Target
    • lt_id - The unique ID for each category.
    • lt_namespace - The namespace Wikipedia categorisation system of the category. The category namespace we are interested in is 14 which denotes Category as the Good Articles and Featured Articles are denoted as types of categories.
    • lt_title - The title of the category. In our case we are searching for Good Articles and Featured Articles. These title are different for each language;
      • For example Dutch - Does not have any Good Articles
        • Wikipedia:Etalage-artikelen - Featured articles category name.

Languages YAML file

To automate the downloading and to store other language specific meta data we have a languages YAML file, which can be found at ./data/languages.yaml, of which part of it is shown below as an example;

languages:
  - language: English
    iso_639_3: eng
    wikipedia_code: en
    fa_article_names:
      - "Featured_articles"
    ga_article_names:
      - "Good_articles"
  - language: Dutch
    iso_639_3: nld
    wikipedia_code: nl
    fa_article_names:
      - "Wikipedia:Etalage-artikelen"

Each Language contains;

  • language - full language name.
  • iso_639_3 - ISO 639-3 language code.
  • wikipedia_code - Wikipedia language code.
  • fa_article_names -- OPTIONAL list of exact matching strings to the featured article topic name for searching on the lt_title field within the Link Target SQL table.
  • ga_article_names -- OPTIONAL list of exact matching strings to the good article topic name for searching on the lt_title field within the Link Target SQL table.

Automated downloading

Using the ./data/languages.yaml file we can download all the SQL tables for all of the languages listed in the YAML file like so;

bash ./scripts/download_meta_data_from_config.sh ./data/languages.yaml ./data/wikipedia_sql_tables 20251201 0

For more information;

bash ./scripts/download_meta_data_from_config.sh
Error: Invalid number of arguments should be 4 got 0 .
The 4 should be <LANGUAGE_FILE> <OUTPUT_DIR> <WIKIPEDIA_DUMP_DATE> <OVERWRITE>
Usage: ./scripts/download_meta_data_from_config.sh <LANGUAGE_FILE> <OUTPUT_DIR> <WIKIPEDIA_DUMP_DATE> <OVERWRITE>
This script downloads all of the required meta data to the <OUTPUT_DIR> for each language in the <LANGUAGE_FILE> 
to <OUTPUT_DIR> within the folder <OUTPUT_DIR>/LANGUAGE_CODE i.e. <OUTPUT_DIR>/en. 
The meta data per language download will be 3 SQL tables from the <WIKIPEDIA_DUMP_DATE> timestamp dump: 
* Category Links SQL table: https://www.mediawiki.org/wiki/Manual:Categorylinks_table
* Link Target SQL table: ttps://www.mediawiki.org/wiki/Manual:Linktarget_table
* Page SQL table: https://www.mediawiki.org/wiki/Manual:Page_table
IF <OVERWRITE> is set to 0 it will not overwrite any of the downloaded data 
if it is set to 1 it will overwrite any of the downloaded data.

Converting SQL dumps into JSONL files

The SQL dumps can be converted into the final JSONL files for each language like so (please read the more information below before running the script), for more information about the JSONL format see the JSONL format section below;

NOTE - this command will load the SQL dumps for the given language into a MariaDB database running via docker, the docker command used is setup to be optimal for 64GB RAM with a SSD computer/machine you may want to change some of the innodb settings (some recommended settings for different machine setups can be found at https://github.com/VolkanSah/optimize-MySQL-MariaDB), once loaded it will then run the following Python script to extract out the article ID's that have been assigned a Good or Featured article tag into the JSONL format, and then it will remove the MariaDB running docker container.

Remember to create your own copy of mariadb_env.env, a template of the mariadb env file can be found at mariadb_template_env.env whereby the database name does not need to be changed but the password field should be filled in/changed.

bash scripts/convert_to_jsonl.sh da ./data/languages.yaml 3307 ./mariadb_env.env ./data/wikipedia_sql_tables/da ./data/wikipedia_fa_ga/da.jsonl

NOTE for large Wikipedia's like English this can take a long time, between 1-2 hours, for the other Wikipedia languages it is at most 30 minutes if not a lot less.

For more information;

bash scripts/convert_to_jsonl.sh
Error: Invalid number of arguments should be 6 got 0 .
The 6 should be <LANGUAGE_CODE> <LANGUAGE_FILE> <MARIADB_PORT> <MARIADB_ENV_FILE> <DATA_DIR> <OUTPUT_FILE>
Usage: scripts/convert_to_jsonl.sh <LANGUAGE_CODE> <LANGUAGE_FILE> <MARIADB_PORT> <MARIADB_ENV_FILE> <DATA_DIR> <OUTPUT_FILE> 
This script for the given <LANGUAGE_CODE> will start up a MariaDB docker container hosted on port 
<MARIADB_PORT> using the MARIADB_ROOT_PASSWORD and MARIADB_DATABASE environment variables saved in 
<MARIADB_ENV_FILE>. The <DATA_DIR> should point to the directory that contains the language specific 
SQL tables whereby each SQL table file name should start with the given <LANGUAGE_CODE> in gziped compressed format. 
The SQL tables will be loaded in the MariaDB database and then the 'wikipedia_ga_fa_data_extraction.py' script 
will extract out the Wikipedia article IDs in JSONL format for Good and Featured articles to the 
given <OUTPUT_FILE>. The <LANGUAGE_FILE> is used to find the relevant Featured and Good article category names 
for the given <LANGUAGE_CODE>

Running in parallel

As mentioned it can take a long time to extract this information out limited by disk write/database write, thus you can run the above command for separate languages at the same time as long as the MariaDB docker container is bound to a different host port, for example converting both Danish (da) and Dutch (nl) at the same time one on port 3307 and the other on 3308.

bash scripts/convert_to_jsonl.sh da ./data/languages.yaml 3307 ./mariadb_env.env ./data/wikipedia_sql_tables/da ./data/wikipedia_fa_ga/da.jsonl
bash scripts/convert_to_jsonl.sh nl ./data/languages.yaml 3308 ./mariadb_env.env ./data/wikipedia_sql_tables/nl ./data/wikipedia_fa_ga/nl.jsonl

JSONL format

Each JSONL line contains the following information for each article, each line is unique, a page_id value only occurs once in the file, and each article has either ga or fa as True;

  • page_id - The unique per Wikipedia language site article ID. (INT).
  • page_title - The 255 byte text string of the article title. This is a string but it can be truncated if the original title was longer than 255 bytes, of which a character can be up to 4 bytes. (STRING).
  • ga - False if the article is not a Good Article (GA), otherwise True. (BOOL).
  • fa - False if the article is not a Featured Article (FA), otherwise True. (BOOL).

Example of a JSONL file;

{"page_id": 130, "page_title": "Norge", "ga": true, "fa": false}
{"page_id": 167, "page_title": "Sverige", "ga": true, "fa": false}

Data scripts and statistics

All of these scripts assume you have already download the data.

Search data by title

If you know an article title that should be in the data you can search for it with the title name and it will return the data entry;

uv run scripts/find_title.py main ./data/wikipedia_fa_ga/en.jsonl "House of Lancaster"
Found: {'page_id': 89442, 'page_title': 'House of Lancaster', 'ga': True, 'fa': False}

For additional help;

uv run scripts/find_title.py main --help

 Usage: find_title.py main [OPTIONS] JSONL_FILE TITLE                                                                                                                                       
                                                                                                                                                                                            
 Search a Wikipedia FA/GA JSONL file for a page title.                                                                                                                                      
                                                                                                                                                                                            
 Reads the file line-by-line so the entire file is never loaded into memory.                                                                                                                
 Optionally restrict the search to only Good Articles (GA) or Featured Articles (FA) entries.                                                                                               
                                                                                                                                                                                            
 Usage:                                                                                                                                                                                     
     python find_title.py data/wikipedia_fa_ga/da.jsonl "København"                                                                                                                         
     python find_title.py data/wikipedia_fa_ga/da.jsonl "København" --filter fa                                                                                                             
     python find_title.py data/wikipedia_fa_ga/da.jsonl "Norge" -f ga                                                                                                                       
     python find_title.py --help                                                                                                                                                            
                                                                                                                                                                                            
╭─ Arguments ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ *    jsonl_file      PATH  Path to the JSONL file to search. [required]                                                                                                                  │
│ *    title           TEXT  Page title to search for. [required]                                                                                                                          │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --filter  -f      [ga|fa]  Restrict search to 'ga' (good articles) or 'fa' (featured articles).                                                                                          │
│ --help                     Show this message and exit.                                                                                                                                   │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Dataset Statistics

If you have all of the JSONL data saved in one directory, in this case ./data/wikipedia_fa_ga/ you can run the following to generate the number of articles that are either Good or Featured for each language.

uv run scripts/language_stats_table.py ./data/wikipedia_fa_ga/ ./data/languages.yaml
Language Code Total GA FA
Chinese zh 4,367 3,339 1,028
Danish da 187 170 17
Dutch nl 380 0 380
English en 49,845 43,023 6,822
Finnish fi 869 516 353
Italian it 1,162 573 589
Korean ko 384 240 144
Portuguese pt 3,488 1,955 1,533
Spanish es 4,710 3,368 1,342

The ./data/languages.yaml file is only used to map Wikipedia language codes to the full language name, see the Languages YAML file section above to understand the format of this YAML file.

For more help;

uv run scripts/language_stats_table.py --help
                                                                                                                                                                                            
 Usage: language_stats_table.py [OPTIONS] DATA_DIR LANGUAGES_YAML                                                                                                                           
                                                                                                                                                                                            
 Prints a per-language table of Total, GA, and FA article counts.                                                                                                                           
                                                                                                                                                                                            
 Reads all JSONL files under the `data_dir`, counts total, GA, and FA                                                                                                                       
 articles for each language, and prints a formatted table.                                                                                                                                  
                                                                                                                                                                                            
 The `languages_yaml` file is required to map Wikipedia codes to language names.                                                                                                            
                                                                                                                                                                                            
╭─ Arguments ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ *    data_dir            PATH  Directory containing per-language JSONL files. [required]                                                                                                 │
│ *    languages_yaml      PATH  Path to the languages.yaml file for language name lookup. [required]                                                                                      │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --help          Show this message and exit.                                                                                                                                              │
╰───────────────────────────────────────────

Uploading Dataset to HuggingFace

Before uploading to the HuggingFace hub please authenticate using a token from huggingface.co/settings/tokens.;

hf auth login

Then to upload;

uv run scripts/upload_to_hugging_hub.py ./data/huggingface_data_readme.md ./data/wikipedia_fa_ga/ data/languages.yaml 2025-12-01 main ucrelnlp/wikipedia-ga-fa-ids

Help information:

uv run scripts/upload_to_hugging_hub.py --help
                                                                                                                                                                                                                  
 Usage: upload_to_hugging_hub.py [OPTIONS] DATASET_TEMPLATE_PATH DATASET_FOLDER                                                                                                                                   
                                 LANGUAGES_YAML_FILE WIKIPEDIA_TIMESTAMP                                                                                                                                          
                                 REVISION REPO_ID                                                                                                                                                                 
                                                                                                                                                                                                                  
╭─ Arguments ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ *    dataset_template_path      PATH  The file path to the dataset README template. [required]                                                                                                                 │
│ *    dataset_folder             PATH  Directory that contains all of the JSONL data files to upload [required]                                                                                                 │
│ *    languages_yaml_file        PATH  The file path to the languages.yaml file. [required]                                                                                                                     │
│ *    wikipedia_timestamp        TEXT  The timestamp of the Wikipedia data dump in the following format (YYYY-MM-DD), i.e. 2025-12-01 [required]                                                                │
│ *    revision                   TEXT  The revision/tag to push the dataset too, i.e. main [required]                                                                                                           │
│ *    repo_id                    TEXT  The Hugging Face repo id to push the dataset too, i.e. ucrelnlp/wikipedia-ga-fa-ids [required]                                                                           │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --help          Show this message and exit.                                                                                                                                                                    │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

If you upload a new data dump timestamp please upload it to the timestamps revision like so (if it is the latest timestamp please also upload it to the main revision as well);

uv run scripts/upload_to_hugging_hub.py ./data/huggingface_data_readme.md ./data/wikipedia_fa_ga/ data/languages.yaml 2025-12-01 2025-12-01 ucrelnlp/wikipedia-ga-fa-ids

License

The code is licensed under Apache License Version 2.0.