A step-by-step guide for adding support for a new MediaWiki
action=query submodule to the Wikipedia-API library.
This guide uses two real examples that are already implemented:
coordinates— aprop=submodule (per-page data, fetched with a page title)search— alist=submodule (standalone query, not tied to a specific page)
Read DESIGN.rst and API.rst before starting.
- Understand the Two Submodule Types
- Step 1: Define Typed Data Classes
- Step 2: Define Parameter Dataclass
- Step 3: Add API Parameter Builder
- Step 4: Add Response Parser
- Step 5: Add Sync Public Method
- Step 6: Add Async Public Method
- Step 7: Add Page Properties
- Step 8: Update Page Init
- Step 9: Export from __init__.py
- Step 10: Add Mock Data
- Step 11: Add Tests
- Step 12: Update Documentation
- Step 13: Run Quality Checks
MediaWiki action=query has two families of submodules:
| Type | API shape | Example | Dispatch helper |
|---|---|---|---|
prop= |
Requires titles=, result in raw["query"]["pages"] |
coordinates, images |
_dispatch_prop or manual _get + iterate pages |
list= |
No titles=, result in raw["query"][list_key] |
geosearch, random, search |
single _get call (see warning below) |
Choose the right type first — it determines which dispatch helper to use and
whether your public method takes a page parameter.
If your submodule returns structured data beyond just page titles, create a frozen dataclass to hold it.
# wikipediaapi/_types.py
@dataclass(frozen=True)
class Coordinate:
"""A single geographic coordinate associated with a Wikipedia page."""
lat: float
lon: float
primary: bool
globe: str = "earth"
type: str | None = None
name: str | None = None
dim: int | None = None
country: str | None = None
region: str | None = None
dist: float | None = None# wikipediaapi/_types.py
@dataclass(frozen=True)
class SearchMeta:
"""Metadata attached to each page returned by search()."""
snippet: str = ""
size: int = 0
wordcount: int = 0
timestamp: str = ""
@dataclass
class SearchResults:
"""Wrapper returned by search() combining pages with aggregate info."""
pages: PagesDict
totalhits: int = 0
suggestion: str | None = NoneRules:
- Use
frozen=Truefor value objects (immutable). - Use plain
@dataclassfor wrappers that hold mutable containers likePagesDict. - Map field names from the JSON keys the MediaWiki API returns.
- Provide sensible defaults for optional fields.
Each submodule has its own set of API parameters with a module-specific prefix
(e.g. co for coordinates, sr for search). Create a frozen dataclass that
maps clean Python names to prefixed MediaWiki names.
# wikipediaapi/_params.py
@dataclass(frozen=True)
class CoordinatesParams(_BaseParams):
"""Parameters for prop=coordinates (prefix co)."""
limit: int = 10
primary: str = "primary"
prop: str = "globe"
distance_from_point: str | None = None
distance_from_page: str | None = None
PREFIX: ClassVar[str] = "co"
FIELD_MAP: ClassVar[dict[str, str]] = {
"limit": "limit", # → colimit
"primary": "primary", # → coprimary
"prop": "prop", # → coprop
"distance_from_point": "distancefrompoint", # → codistancefrompoint
"distance_from_page": "distancefrompage", # → codistancefrompage
}@dataclass(frozen=True)
class SearchParams(_BaseParams):
"""Parameters for list=search (prefix sr)."""
query: str = ""
namespace: int = 0
limit: int = 10
sort: str = "relevance"
# ... more fields as needed
PREFIX: ClassVar[str] = "sr"
FIELD_MAP: ClassVar[dict[str, str]] = {
"query": "search", # → srsearch (note: API param name differs!)
"namespace": "namespace", # → srnamespace
"limit": "limit", # → srlimit
"sort": "sort", # → srsort
}How it works:
_BaseParamsprovidesto_api()which iteratesFIELD_MAPand produces{PREFIX + suffix: str(value)}for every non-None field._BaseParamsprovidescache_key()which returns a hashable tuple of all field values, used by the per-parameter cache on page objects.
How to find the prefix: Check the MediaWiki API help page for your module.
The prefix is shown in the module header — e.g. prop=coordinates (co) means
prefix co. Every parameter starts with that prefix:
colimit, coprimary, coprop, etc.
Add a method that builds the raw API params dict. This is shared by both sync and async code paths.
# In BaseWikipediaResource class
def _coordinates_api_params(
self,
page: "BaseWikipediaPage[Any]",
params: CoordinatesParams,
) -> dict[str, Any]:
"""Build API params for prop=coordinates."""
api_params: dict[str, Any] = {
"action": "query",
"prop": "coordinates", # ← the submodule name
"titles": page.title, # ← prop= submodules need a page title
}
api_params.update(params.to_api()) # ← merge prefixed params
return api_paramsdef _search_api_params(self, params: SearchParams) -> dict[str, Any]:
"""Build API params for list=search."""
api_params: dict[str, Any] = {
"action": "query",
"list": "search", # ← standalone list, no page title
}
api_params.update(params.to_api())
return api_paramsKey difference: prop= builders take a page parameter and include
"titles": page.title. list= builders do not.
Parse the raw JSON response into your typed data classes. This is also shared by sync and async.
For prop= modules, the parser processes a single page entry from
raw["query"]["pages"]:
def _build_coordinates_for_page(
self,
extract: dict[str, Any], # single page entry
page: "BaseWikipediaPage[Any]",
params: CoordinatesParams,
) -> list[Coordinate]:
"""Parse coordinates from a single page API response entry."""
self._common_attributes(extract, page) # always call this first
coords: list[Coordinate] = []
for raw_coord in extract.get("coordinates", []):
coords.append(
Coordinate(
lat=float(raw_coord["lat"]),
lon=float(raw_coord["lon"]),
primary=raw_coord.get("primary", "") == "",
globe=raw_coord.get("globe", "earth"),
# ... more fields
)
)
# Store in per-parameter cache
page._set_cached("coordinates", params.cache_key(), coords)
return coordsFor list= modules, the parser processes the entire raw response:
def _build_search_results(self, raw: dict[str, Any]) -> SearchResults:
"""Parse search list results into a SearchResults wrapper."""
pages = PagesDict(wiki=self)
raw_query = raw.get("query", {})
for entry in raw_query.get("search", []):
p = self._make_page( # ← creates correct page type
title=entry["title"],
ns=int(entry.get("ns", 0)),
language=self.language,
variant=self.variant,
)
p._attributes["pageid"] = entry.get("pageid", -1)
# Attach per-result metadata
p._search_meta = SearchMeta(
snippet=entry.get("snippet", ""),
size=int(entry.get("size", 0)),
wordcount=int(entry.get("wordcount", 0)),
timestamp=entry.get("timestamp", ""),
)
pages[entry["title"]] = p
# Extract aggregate info
searchinfo = raw_query.get("searchinfo", {})
return SearchResults(
pages=pages,
totalhits=int(searchinfo.get("totalhits", 0)),
suggestion=searchinfo.get("suggestion"),
)Important patterns:
- Always use
self._make_page()to create child pages — this ensures the correct type (WikipediaPagevsAsyncWikipediaPage) is created based on whether we're in sync or async context. - Always call
self._common_attributes(extract, page)forprop=parsers. - For
list=parsers, pre-set_attributes["pageid"]from the response.
prop= methods take a page, construct params, check cache, call _get, and
iterate the response:
def coordinates(
self,
page: WikipediaPage,
*, # ← keyword-only after page
limit: int = 10,
primary: str = "primary",
prop: str = "globe",
distance_from_point: str | None = None,
distance_from_page: str | None = None,
) -> list[Coordinate]:
"""Fetch geographic coordinates for a page."""
# 1. Build params object
params = CoordinatesParams(
limit=limit, primary=primary, prop=prop,
distance_from_point=distance_from_point,
distance_from_page=distance_from_page,
)
# 2. Check per-parameter cache
cached = page._get_cached("coordinates", params.cache_key())
if not isinstance(cached, type(NOT_CACHED)):
return cached
# 3. Build API params and make request
api_params = self._coordinates_api_params(page, params)
raw = self._get(page.language, self._construct_params(page, api_params))
# 4. Iterate response pages and parse
self._common_attributes(raw.get("query", {}), page)
for k, v in raw.get("query", {}).get("pages", {}).items():
if k == "-1":
page._attributes["pageid"] = -1
page._set_cached("coordinates", params.cache_key(), [])
return []
return self._build_coordinates_for_page(v, page, params)
page._set_cached("coordinates", params.cache_key(), [])
return []list= methods don't take a page — they make a single _get call:
def search(
self,
query: str,
*,
namespace: int = 0,
limit: int = 10,
sort: str = "relevance",
) -> SearchResults:
"""Search Wikipedia for pages matching a query."""
# 1. Build params object
params = SearchParams(query=query, namespace=namespace, limit=limit, sort=sort)
# 2. Build API params
api_params = self._search_api_params(params)
# 3. Single request — the caller's limit controls how many results
raw = self._get(
self.language,
self._construct_params_standalone(api_params),
)
# 4. Parse the response
return self._build_search_results(raw)
⚠️ WARNING — Do NOT use_dispatch_standalone_listfor standalone list queries.
_dispatch_standalone_listpaginates by looping while the API returns acontinuetoken. This causes infinite loops or near-infinite loops for standalone list queries:
random— the API always returns acontinuetoken (there are always more random pages), so the loop never terminates.search— broad queries match thousands of pages; the loop would make thousands of API calls before exhausting all results.geosearch— densely populated areas can produce very long continuation chains.The caller's
limitparameter already tells the MediaWiki API how many results to return in a single response. Always use a single_get/await self._getcall for these methods.
_dispatch_standalone_listexists in the codebase but is currently unused. It should only be used if you genuinely need to exhaust all results from a list query (and even then, add a safeguard).
The async method mirrors the sync method exactly, but uses await and the
_async_* dispatch helpers.
async def coordinates(
self,
page: AsyncWikipediaPage,
*,
limit: int = 10,
primary: str = "primary",
prop: str = "globe",
distance_from_point: str | None = None,
distance_from_page: str | None = None,
) -> list[Coordinate]:
"""Async: Fetch geographic coordinates for a page."""
params = CoordinatesParams(...)
cached = page._get_cached("coordinates", params.cache_key())
if not isinstance(cached, type(NOT_CACHED)):
return cached
api_params = self._coordinates_api_params(page, params)
raw = await self._get(...) # ← await
# ... same iteration logic as syncasync def search(self, query: str, *, ...) -> SearchResults:
"""Async: Search Wikipedia."""
params = SearchParams(...)
api_params = self._search_api_params(params)
# Single request — same as sync, just with await
raw = await self._get(
self.language,
self._construct_params_standalone(api_params),
)
return self._build_search_results(raw)🚨 CRITICAL: The sync and async methods MUST have identical signatures
(same parameter names, types, defaults, return type). Only the dispatch calls
differ (self._get(...) vs await self._get(...) ).
Only add page-level properties for prop= submodules where it makes sense to
access the data as page.coordinates or page.images. Standalone list=
modules like search and random don't need page properties — they return
results at the wiki client level.
For list= modules that attach metadata to result pages (like geosearch_meta
or search_meta), add a plain @property (no network call) on the page.
For a fetching property (like coordinates):
@property
def coordinates(self) -> list[Coordinate]:
"""Geographic coordinates for this page."""
default_params = CoordinatesParams()
cached = self._get_cached("coordinates", default_params.cache_key())
if isinstance(cached, type(NOT_CACHED)):
self.wiki.coordinates(self) # ← triggers fetch
cached = self._get_cached("coordinates", default_params.cache_key())
if isinstance(cached, type(NOT_CACHED)):
return []
return cachedFor a plain metadata property (like search_meta):
@property
def search_meta(self) -> SearchMeta | None:
"""Search metadata, or None if page didn't come from search()."""
return self._search_metaFor a fetching property — returns a coroutine so callers use
await page.coordinates:
@property
def coordinates(self) -> Any:
"""Awaitable: geographic coordinates for this page."""
async def _get() -> list[Coordinate]:
default_params = CoordinatesParams()
cached = self._get_cached("coordinates", default_params.cache_key())
if isinstance(cached, type(NOT_CACHED)):
await self.wiki.coordinates(self) # ← await
cached = self._get_cached("coordinates", default_params.cache_key())
if isinstance(cached, type(NOT_CACHED)):
return []
return cached
return _get() # ← returns the coroutine, not the resultFor a plain metadata property — identical in both sync and async (no await):
@property
def search_meta(self) -> SearchMeta | None:
"""Search metadata, or None."""
return self._search_metaSync (WikipediaPage) |
Async (AsyncWikipediaPage) |
|---|---|
@property that fetches → returns value |
@property that returns a coroutine → await page.foo |
@property no fetch → returns value |
@property no fetch → returns value (identical) |
Add cache slots for your new data:
# In BaseWikipediaPage.__init__():
# For per-parameter cached data (coordinates, images):
# Already handled by _param_cache dict — no new slot needed.
# For metadata attached by list= queries:
self._search_meta: Any = None
self._geosearch_meta: Any = NoneThe _param_cache dict (already initialized as {}) handles per-parameter
caching for coordinates and images automatically. You only need to add
explicit _<name> attributes for metadata properties like geosearch_meta
and search_meta.
Add your new types to the public API:
# wikipediaapi/__init__.py
from ._types import Coordinate
from ._types import SearchMeta
from ._types import SearchResults
__all__ = [
# ... existing exports ...
"Coordinate",
"SearchMeta",
"SearchResults",
]Add mock API responses that match the exact cache key format used by the test infrastructure. The key format is:
{language}:{param1}={value1}&{param2}={value2}&...&
Parameters are sorted alphabetically. Trailing & is required.
# Successful response
"en:action=query&colimit=10&coprimary=primary&coprop=globe&format=json&prop=coordinates&redirects=1&titles=Test_1&": {
"batchcomplete": "",
"query": {
"pages": {
"4": {
"pageid": 4,
"ns": 0,
"title": "Test 1",
"coordinates": [
{
"lat": 51.5074,
"lon": -0.1278,
"primary": "", # "" means primary in MW API
"globe": "earth",
}
],
}
}
},
},
# Non-existent page
"en:action=query&colimit=10&coprimary=primary&coprop=globe&format=json&prop=coordinates&redirects=1&titles=NonExistent&": {
"batchcomplete": "",
"query": {
"pages": {
"-1": {
"ns": 0,
"title": "NonExistent",
"missing": "",
}
}
},
},"en:action=query&format=json&list=search&redirects=1&srlimit=10&srnamespace=0&srsearch=Python&srsort=relevance&": {
"batchcomplete": "",
"query": {
"searchinfo": {"totalhits": 5432, "suggestion": "python programming"},
"search": [
{
"ns": 0,
"title": "Python (programming language)",
"pageid": 300,
"size": 123456,
"wordcount": 15000,
"snippet": "<span>Python</span> is a programming language",
"timestamp": "2024-01-01T00:00:00Z",
},
],
},
},- Look at your params dataclass defaults and
to_api()output. - Combine with the base params (
action=query,format=json,redirects=1,prop=Xorlist=X, andtitles=Yfor prop modules). - Sort all params alphabetically, join with
&, add trailing&. - Prepend
{language}:.
Tip: If unsure, add a print() in mock_data.py's wikipedia_api_request
to see what key is being looked up at test time.
Create a test file tests/<module>_test.py or add to
tests/query_submodules_test.py.
class TestCoordinates(unittest.TestCase):
def setUp(self):
self.wiki = wikipediaapi.Wikipedia(user_agent, "en")
self.wiki._get = wikipedia_api_request(self.wiki)
def test_coordinates_default(self):
page = self.wiki.page("Test_1")
coords = self.wiki.coordinates(page)
self.assertEqual(len(coords), 1)
self.assertAlmostEqual(coords[0].lat, 51.5074)
self.assertTrue(coords[0].primary)
def test_coordinates_nonexistent_page(self):
page = self.wiki.page("NonExistent")
coords = self.wiki.coordinates(page)
self.assertEqual(coords, [])
def test_coordinates_cached(self):
page = self.wiki.page("Test_1")
coords1 = self.wiki.coordinates(page)
coords2 = self.wiki.coordinates(page)
self.assertIs(coords1, coords2) # same object = cache hit
def test_page_coordinates_property(self):
page = self.wiki.page("Test_1")
coords = page.coordinates
self.assertEqual(len(coords), 1)class TestSearch(unittest.TestCase):
def setUp(self):
self.wiki = wikipediaapi.Wikipedia(user_agent, "en")
self.wiki._get = wikipedia_api_request(self.wiki)
def test_search(self):
results = self.wiki.search("Python")
self.assertIsInstance(results, wikipediaapi.SearchResults)
self.assertEqual(len(results.pages), 2)
def test_search_totalhits(self):
results = self.wiki.search("Python")
self.assertEqual(results.totalhits, 5432)
def test_search_meta(self):
results = self.wiki.search("Python")
p = results.pages["Python (programming language)"]
self.assertIsNotNone(p.search_meta)
self.assertEqual(p.search_meta.size, 123456)class TestAsyncSearch(unittest.IsolatedAsyncioTestCase):
def setUp(self):
self.wiki = wikipediaapi.AsyncWikipedia(user_agent, "en")
self.wiki._get = async_wikipedia_api_request(self.wiki)
async def test_async_search(self):
results = await self.wiki.search("Python")
self.assertIsInstance(results, wikipediaapi.SearchResults)
self.assertEqual(len(results.pages), 2)- ✅ Default parameters return expected data
- ✅ Custom parameters (e.g.
primary="all") return different data - ✅ Non-existent page returns empty result
- ✅ Cache hit returns same object (
assertIs) - ✅ Per-parameter cache separates different param sets
- ✅ Page property triggers fetch and returns correct data
- ✅ Typed data classes have correct fields
- ✅ Frozen dataclasses reject mutation
- ✅ Async versions of all the above
After implementing, update these files:
-
API.rst— Add method signatures toWikipediaandAsyncWikipediasections, add properties toWikipediaPageandAsyncWikipediaPagesections, add new data classes to "Typed Data Classes" section. -
DESIGN.rst— Update class diagram, dispatch helpers mapping, invariants section. -
examples/example_sync.py— Add a usage example section (numbered, with comments). -
examples/example_async.py— Mirror the sync example withawait. -
index.rst— Add a "How To" section with sync and async code blocks. -
README.rst— Should mirrorindex.rst.
# All pre-commit hooks (isort, black, flake8, mypy, pyupgrade)
make run-pre-commit
# Unit tests (414+ tests)
make run-tests
# Coverage (must stay ≥ 90%, target 96%)
make run-coverageFix any issues and re-run until everything passes.
| File | What to add |
|---|---|
wikipediaapi/_types/ |
Frozen dataclass for response data |
wikipediaapi/_params/ |
Frozen dataclass for API parameters |
wikipediaapi/_resources/ |
_*_api_params(), _build_*(), sync method, async method |
wikipediaapi/_page/_base_wikipedia_page.py |
Cache slots (if needed) in __init__ |
wikipediaapi/_page/wikipedia_page.py |
Sync @property |
wikipediaapi/_page/async_wikipedia_page.py |
Async @property (returns coroutine) |
wikipediaapi/__init__.py |
Export new types |
tests/mock_data.py |
Mock API responses |
tests/query_submodules_test.py |
Sync + async tests |
API.rst |
Public API reference |
DESIGN.rst |
Architecture docs |
examples/example_sync.py |
Sync usage example |
examples/example_async.py |
Async usage example |
index.rst |
User-facing docs |
-
Wrong cache key in mock data — The key must exactly match the sorted, prefixed params. Add a print statement to debug.
-
Forgetting async symmetry — Every sync method needs an identical async counterpart. Every sync page property needs an async page property.
-
Using
_getinstead ofawait self._getin async methods — Will silently return a coroutine object instead of the result. -
Not using
_make_page()— If you manually constructWikipediaPage()in a_build_*method, async callers will get sync page objects. Always useself._make_page(). -
Title normalization in batch methods — MediaWiki normalizes titles (e.g.
Test_1→Test 1). Use_build_normalization_map(raw)and look up original titles in the norm map. -
Not exporting from
__init__.py— Users import fromwikipediaapidirectly. If you forget to export, they can't access your types. -
Standalone list params missing
redirects=1— The_construct_params_standalone()method adds this automatically, but your mock data keys need to include it.