A curated corpus of public domain literary texts from Project Gutenberg — organized for analysis, annotation, and knowledge extraction.
This repository contains public domain literary texts sourced from Project Gutenberg, organized by author and work. Texts are provided both as full ebooks and as individual sections to support fine-grained annotation and analysis.
data/ebooks/— Full plain-text ebooks organized by Project Gutenberg IDauthors/— Texts organized by author and title, with works split into named sections (chapters, acts, scenes, etc.)
All content is in the public domain and freely available via Project Gutenberg.
This corpus is well-suited for entity recognition across characters, authors, places, and historical references; mapping relationships between characters, themes, and narrative arcs; stylistic and linguistic analysis across authors and time periods; and building literary knowledge graphs from unstructured text.
This repo ships eleven skills that build a layered literary KB on top of the Semiont SDK. See AGENTS.md for the full design discussion.
| Skill | What it does |
|---|---|
ingest-corpus |
Walk the repo's literary content (sections, curated articles, ebooks); create one resource per file. |
mark-characters |
Detect Character mentions including descriptive references ("the Chorus", "the messenger"). |
mark-places |
Detect Place mentions — both real geographic places and mythological realms. |
assess-dangerous-situations |
Flag spans of physical / moral / supernatural danger; used as peak-danger landmarks for plot-arc synthesis. |
comment-subtext |
Add commenting annotations capturing inner thoughts, subtext, dramatic irony, plot-significance markers. |
build-character-articles |
Promote Character mentions to canonical wiki-style Character resources with Wikipedia citations. |
build-place-articles |
Promote Place mentions to canonical wiki-style Place resources with Wikipedia citations. |
map-relationships |
Extract character-character + character-place relationships as tagged linking annotations. |
build-historical-context |
Synthesize HistoricalContext resources for the work's era, prior tradition, audience, and concepts. |
extract-themes |
Tag passages with thematic concerns; synthesize one Theme resource per distinct theme. |
trace-plot-arc |
Synthesize a per-work PlotArc resource documenting the narrative structure. |
Explore this dataset using Semiont, an open-source knowledge base platform for annotation and knowledge extraction.
This repo follows the same layout and startup flow as semiont-template-kb. See its README for full setup instructions:
- Quick Start: Local — run the backend stack on your machine via
.semiont/scripts/start.sh - Quick Start: Codespaces — launch a preconfigured backend in the cloud
- Inference Configuration — Ollama (local) vs. Anthropic (cloud) configs
Install the GitHub CLI (gh) if you haven't already.
Before creating: add
ANTHROPIC_API_KEYas a user secret with this repo selected. Otherwise the backend comes up but inference is non-functional until you add the secret and rebuild the container.
Create the codespace on a premium machine for faster builds and more headroom:
gh codespace create --repo The-AI-Alliance/semiont-gutenberg-kb --machine premiumLinuxForward the backend port to your local machine, then fetch the auto-generated admin credentials:
gh codespace ports forward 4000:4000
gh codespace ssh -- cat .devcontainer/admin.jsonThe credentials let you log in via the Semiont browser — see Quick Start: Codespaces on the template-kb README for the full browser-side flow.
Apache 2.0 — See LICENSE for details.