Skip to content
This repository was archived by the owner on Dec 30, 2025. It is now read-only.

grey-box/wil-symmetry-ccsu-rawkit-2025

Repository files navigation

Wikipedia Data Extraction API

FastAPI-based service for extracting structured data (tables, infoboxes, citations, etc.) from Wikipedia articles.

Setup

  1. Install dependencies:
pip install -r requirements.txt

Running the API Server

Start the FastAPI server:

uvicorn Application:application --reload --host 0.0.0.0 --port 8000

The API will be available at:

Using the Client Script

Run the client script to interact with the API:

python client.py

The client script includes:

  • Example requests for extracting tables from Wikipedia articles
  • Interactive mode for testing different articles and languages

API Endpoints

POST /extract-tables

Extract all tables from a Wikipedia article.

Request Body:

{
  "page_title": "Python (programming language)",
  "language": "english"
}

Response:

{
  "article_title": "Python (programming language)",
  "language": "english",
  "number_of_tables": 2,
  "tables": [
    {
      "table_id": "table_1",
      "headers": ["Column1", "Column2"],
      "rows": [["Value1", "Value2"]],
      "caption": "Table caption"
    }
  ],
  "table_details": {
    "1": "5 rows and 3 columns",
    "2": "10 rows and 4 columns"
  }
}

POST /info-box

Extract infobox from Wikipedia article (JSON input).

POST /tables

Parse tables from article JSON (legacy endpoint).

POST /citation

Extract citations from Wikipedia article.

POST /head

Extract headers from Wikipedia article.

Example Usage

Using Python requests:

import requests

response = requests.post(
    "http://localhost:8000/extract-tables",
    json={
        "page_title": "Python (programming language)",
        "language": "english"
    }
)
print(response.json())

Using curl:

curl -X POST "http://localhost:8000/extract-tables" \
  -H "Content-Type: application/json" \
  -d '{"page_title": "Python (programming language)", "language": "english"}'

Supported Languages

  • english (en)
  • spanish (es)
  • german (de)
  • dutch (nl)
  • Or any ISO 639-1 two-letter language code

About

Research to further Project Symmetry's analysis of supporting elements on Wikipedia pages (tables, infoboxes, citations, etc.). Central Connecticut State University, Fall 2025

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors