Conversational Ontology Builder

Project Vision

Build a conversational AI system that learns semantic definitions through natural dialogue and constructs a formal RDF ontology. The system should proactively extract meanings, definitions, and relationships from conversational statements, building a shared knowledge base collaboratively with the user. Start knowing nothing about the domain - only standard ontology foundations - and progressively learn what things mean through conversation.

Core Principle: Definitional Extraction & Collaborative Learning

The system should proactively extract definitions and meanings from natural conversational statements. When someone defines or describes something, capture that semantic knowledge without requiring formal statements.

Key Behaviors:

Extract definitions - "X is Y" or "X is how we Z" → Capture semantic relationship
Proactively identify concepts - Recognize new terms and extract their meaning from context
Ask clarifying questions - Only when definitions are vague, ambiguous, or missing
Ground in standards when possible - Link to OWL-Time, org:, PROV-O, SKOS, Dublin Core, Schema.org for fallback
Remember context - Once defined, use that definition consistently
Allow refinement - Definitions can be extended or clarified in later conversation
Handle semantic flexibility - Recognize when the same word has multiple valid meanings in different contexts (polysemy) and account for it by recognising that they are different entities but may have the same signifier.

Focus: Semantic Knowledge, Not Policies

This is about building shared understanding of what things mean, not translating rules or policies. Focus on:

Definitions: What things are ("retention is how we measure...")
Relationships: How things relate ("files have retention periods")
Properties: What attributes things have ("files have a creation date")
Taxonomies: How things categorize ("a document is a type of file")

Not about: Rules, obligations, requirements, policies (that's a different system)

Example - Definitional Extraction:

User: "retention is how we measure how long a file should be kept"

System: (proactively extracts)
        ✓ "retention" - concept related to duration measurement
        ✓ "file" - entity that can be retained
        ✓ relationship: retention applies to files
        ✓ temporal aspect: involves duration (time:Duration)

        I understand "retention" is about duration for keeping files.

        What is a "file" in your context?

User: "a file is a digital document"

System: ✓ file → digital document
        ✓ grounding in Dublin Core: dcterms:Document

        Perfect! Adding:

        :File a owl:Class ;
            rdfs:subClassOf dcterms:Document ;
            rdfs:label "File" ;
            rdfs:comment "A digital document" .

        :Retention a owl:Class ;
            rdfs:label "Retention" ;
            rdfs:comment "Measurement of how long a file should be kept" ;
            :appliesToType :File ;
            :measurementType time:Duration .

        Got it. Continue...

User: "files can be documents or records"

System: (refining earlier definition)
        ✓ File has subtypes: Document, Record

        Updating ontology:

        :Document a owl:Class ;
            rdfs:subClassOf :File .

        :Record a owl:Class ;
            rdfs:subClassOf :File .

        Refined! File now has two subtypes. What else?

Critical: Semantic Flexibility & Ambiguity

The Problem: Words Have Multiple Valid Meanings

DO NOT treat the first definition as immutable truth. The same word can have multiple legitimate meanings in different contexts:

"file" could mean:
- Digital file (computer file like .docx, .pdf)
- Physical file (paper folder in a filing cabinet)
- Database record
- Tool for smoothing surfaces
"clearance" could mean:
- Security clearance (credential)
- Physical clearance (space measurement)
- Sale clearance (retail)

The Solution: Contextual Disambiguation

When a user mentions a term that already exists in the ontology, but uses it differently:

❌ WRONG - Rigid enforcement:

User: "we also track physical files in filing cabinets"
System: Error! File is defined as digital document (dcterms:Document)

✅ CORRECT - Flexible disambiguation:

User: "we also track physical files in filing cabinets"

System: I notice you mentioned "file" - I have that defined as a digital document (.docx, etc).

        Are you talking about:
        A) The same thing (digital files)
        B) A different kind of file (physical filing cabinet files)

User: B - physical files are different

System: Got it! I'll create a distinction:

        # Update ontology to handle both senses
        :File a owl:Class ;
            rdfs:label "File" ;
            rdfs:comment "General concept - see specific types" .

        :DigitalFile a owl:Class ;
            rdfs:subClassOf :File, dcterms:Document ;
            rdfs:label "Digital File" ;
            rdfs:comment "Computer file (.docx, .pdf, etc.)" .

        :PhysicalFile a owl:Class ;
            rdfs:subClassOf :File ;
            rdfs:label "Physical File" ;
            rdfs:comment "Paper folder stored in filing cabinets" .

        Now you have both digital and physical files! Which were you referring to?

Detection Strategy

When encountering a term that exists in ontology:

Check if usage matches existing definition
- Does context align with what's already defined?
- Are they talking about the same thing?
If usage conflicts, ask clarifying question:
- "I have [term] as [existing definition]. Are you talking about that, or something different?"
- Give user options: same concept, different concept, or refinement
If different, create disambiguation:
- Create qualified/scoped versions (:DigitalFile vs :PhysicalFile)
- Update original to be more general if needed
- Link both to appropriate standard ontologies

Examples of Handling Ambiguity

Example 1: File - Digital vs Physical

Ontology already has: :File → dcterms:Document (digital)

User: "physical files need retention periods too"

System: I have "file" as digital documents. Are you now talking about physical paper files?

User: Yes

System: Creating distinction:
        :DigitalFile (for computer files)
        :PhysicalFile (for paper files)
        Both are types of :File

        Which type needs retention periods - both?

Example 2: Record - Database vs Legal

Ontology has: :Record → database record

User: "legal records must be kept for 7 years"

System: I have "record" as a database record. Is "legal record" the same thing, or different?

User: Different - legal records are official documents

System: ✓ Creating separate concept:
        :DatabaseRecord (existing - database entry)
        :LegalRecord (new - official legal document, subclass of dcterms:Document)

Example 3: Clearance - Security vs Physical

Ontology has: :Clearance → security credential

User: "equipment needs 2 meters of clearance"

System: I have "clearance" as security credential. This sounds like you mean physical space/distance?

User: Yes, different meaning

System: Creating separate concepts:
        :SecurityClearance (credential)
        :PhysicalClearance (spatial measurement, links to qudt:Length)

        Both are valid meanings of "clearance" in different contexts.

Key Principles for Semantic Flexibility

Never assume first definition is only definition
Context matters - same word, different domains → likely different meanings
Ask don't assume - when usage doesn't match existing definition, clarify
Disambiguate with qualifiers - DigitalFile vs PhysicalFile, not File1 vs File2
Both can be correct - multiple senses are normal in natural language
Update ontology structure - may need to make existing concept more general
Ground each sense appropriately - digital file → dcterms:Document, physical file → maybe prov:Entity

Pattern: Handling Conflicting Usage

# Pseudocode for handling term conflicts
if term exists in ontology:
    if user_context matches existing_definition:
        use_existing_concept()
    else:
        ask_user: "I have {term} as {existing}. Same thing or different?"

        if different:
            # Create qualified versions
            create_specific_term(f"{qualifier}{term}")  # e.g., DigitalFile
            update_existing_to_general_if_needed()
            ground_both_appropriately()
            show_user_the_distinction()

Integration with RAG/MCP Systems

Important for implementations using RAG or MCP:

When retrieving ontology knowledge:

Don't just match on term string
Consider context and usage
When conflict detected, trigger disambiguation flow
Update ontology with new sense, don't overwrite
Use qualified names in SPARQL queries if multiple senses exist

Example SPARQL that handles multiple senses:

# Instead of:
SELECT ?file WHERE { ?file a :File }

# Use qualified:
SELECT ?digitalFile WHERE { ?digitalFile a :DigitalFile }
SELECT ?physicalFile WHERE { ?physicalFile a :PhysicalFile }

Standard Ontology Foundation

The system should recognize and ground concepts in these W3C standard ontologies:

1. OWL-Time (Temporal Concepts)

Namespace: http://www.w3.org/2006/time#
Prefix: time:
Use for: Dates, durations, temporal relations, intervals
Key concepts:
- time:Duration - Use ISO 8601 format (P7Y = 7 years, P12M = 12 months)
- time:Instant - Points in time
- Temporal relations: time:before, time:after, time:intervalDuring
Always use ISO 8601 durations: "7 years" → "P7Y", "12 months" → "P12M"

2. Organization Ontology (org:)

Namespace: http://www.w3.org/ns/org#
Prefix: org:
Use for: Organizations, organizational units, roles, memberships
Key concepts:
- org:Organization - Companies, government departments, teams
- org:OrganizationalUnit - Divisions, departments, business units
- org:Role - Job roles, responsibilities
- org:Membership - Person's role in organization

3. PROV-O (Provenance)

Namespace: http://www.w3.org/ns/prov#
Prefix: prov:
Use for: Activities, agents, entities, processes
Key concepts:
- prov:Activity - Processes, assessments, actions
- prov:Agent - People, systems, organizations
- prov:Entity - Documents, records, artifacts

4. SKOS (Knowledge Organization)

Namespace: http://www.w3.org/2004/02/skos/core#
Prefix: skos:
Use for: Concept schemes, taxonomies, controlled vocabularies
Key concepts:
- skos:Concept - Individual concepts in a scheme
- skos:ConceptScheme - Classification systems
- skos:prefLabel, skos:altLabel - Labeling
- skos:broader, skos:narrower - Hierarchical relationships

5. Dublin Core (Metadata)

Namespace: http://purl.org/dc/terms/
Prefix: dcterms:
Use for: Descriptive metadata, documentation
Key concepts:
- dcterms:title, dcterms:description
- dcterms:creator, dcterms:created, dcterms:modified
- dcterms:type, dcterms:subject

6. RDFS & OWL (Core Semantics)

Use for: Class/property definitions, basic relationships
Key concepts:
- rdfs:Class, owl:Class - Defining types
- rdfs:subClassOf - Class hierarchies
- rdfs:label, rdfs:comment - Human-readable annotations
- owl:ObjectProperty, owl:DatatypeProperty - Relationships

Technical Architecture

Technology Stack

Python 3.9+
RDFLib - For RDF manipulation and TTL serialization
OpenAI API - For conversational interface (GPT-4 or Claude via API)
No triplestore required - File-based TTL for persistence

Core Components

1. Conversational Agent (`conversation_agent.py`)

Proactively extracts definitions from user statements
Multi-turn dialogue management
Maintains conversation context and memory
Asks clarifying questions only when definitions are unclear
Coordinates between user input and ontology operations
Shows what's being added to the ontology in real-time

2. Ontology Manager (`ontology_manager.py`)

Loads standard ontologies (OWL-Time, org:, PROV-O, SKOS, Dublin Core)
Checks if terms exist in loaded ontologies
Generates unique IRIs for new concepts (human-readable, collision-free)
Adds new triples to the growing domain ontology
Handles refinement of existing definitions
Serializes to TTL format after each addition

3. Definition Extractor (`definition_extractor.py`)

Identifies definitional patterns ("X is Y", "X means Z", "X is how we...")
Extracts entities, relationships, and properties from statements
Recognizes taxonomic relationships (subclass, instance)
Identifies attributes and their types
Flags terms needing clarification

4. Grounding Engine (`grounding_engine.py`)

Matches extracted concepts to standard ontologies
Suggests which standard concepts apply (e.g., "duration" → time:Duration)
Links domain concepts to foundation ontologies
Validates semantic grounding
Ensures every concept connects to formal semantics

5. Standard Ontology Loader (`standard_ontologies.py`)

Pre-loads W3C standard vocabularies (hardcoded mappings)
Provides lookup functions (e.g., "is 'Activity' a known concept?")
Suggests relevant standard concepts during extraction
Returns namespace URIs and definitions

File Structure

conversational-ontology-builder/
├── README.md
├── requirements.txt
├── .env
├── ontology_builder/
│   ├── __init__.py
│   ├── conversation_agent.py      # Main conversational interface
│   ├── ontology_manager.py        # RDFLib ontology operations
│   ├── grounding_engine.py        # Progressive grounding logic
│   ├── standard_ontologies.py     # W3C vocab loader
│   ├── identifier_generator.py    # Human-readable IRI generation
│   └── prompts.py                 # LLM system prompts
├── ontologies/
│   ├── domain_ontology.ttl        # The growing domain ontology (starts empty)
│   └── standards/                 # Downloaded standard ontologies (optional)
│       ├── time.ttl
│       ├── org.ttl
│       ├── prov.ttl
│       └── skos.ttl
└── tests/
    └── test_grounding.py

Conversation Flow Architecture

State Machine (Definition-Driven)

START
  ↓
[User makes statement]
  ↓
[Extract concepts, definitions, relationships]
  ↓
For each extracted concept:
  ↓
  [Check if already in ontology] ──────→ EXISTS ──→ [Use existing definition]
  ↓ NEW CONCEPT                                      (or refine if more info)
  [Extract definition from context]
  ↓
  [Is definition clear?]
  ↓
  ├─→ YES ──→ [Ground in standard ontology] ──→ [Add to ontology] ──→ [Show user]
  ├─→ NO ──→ [Ask clarifying question] ──→ [Get more info] ──→ [Loop back]
  └─→ PARTIAL ──→ [Add what we know, flag for refinement] ──→ [Continue]
  ↓
[Continue conversation]

Proactive Definition Extraction

The system should recognize definitional patterns in natural speech:

Pattern 1: "X is Y"

"retention is a duration" → :Retention rdfs:subClassOf time:Duration

Pattern 2: "X is how we Y"

"retention is how we measure storage time" → Extract concept + purpose

Pattern 3: "X means Y"

"SC means Security Clearance" → skos:altLabel or rdfs:comment

Pattern 4: "X has Y"

"files have creation dates" → Property: :hasCreationDate

Pattern 5: "X can be Y or Z"

"files can be documents or records" → Subtypes: :Document, :Record

Pattern 6: "X is part of Y"

"UKSV is part of Cabinet Office" → org:unitOf

Conversation Examples

Example 1: Definitional Extraction - Simple

User: retention is how we measure how long a file should be kept

System: (extracts definitions)
        ✓ "retention" - concept about measurement/duration
        ✓ "file" - entity mentioned
        ✓ relationship: retention measures duration for files
        ✓ temporal aspect: duration → time:Duration

        I understand "retention" is about measuring storage duration for files.

        What is a "file" in your context?

User: a file is a digital document

System: ✓ "digital document" → dcterms:Document (Dublin Core)

        Perfect! Adding:

        :File a owl:Class ;
            rdfs:subClassOf dcterms:Document ;
            rdfs:label "File" ;
            rdfs:comment "A digital document" .

        :Retention a owl:Class ;
            rdfs:label "Retention" ;
            rdfs:comment "Measurement of duration for keeping files" .

        :hasRetentionPeriod a owl:DatatypeProperty ;
            rdfs:label "has retention period" ;
            rdfs:domain :File ;
            rdfs:range time:Duration .

        Got it! Continue telling me about your domain...

User: files can be documents or records

System: (refining earlier definition)
        ✓ File has subtypes: documents and records

        Updating:

        :Document a owl:Class ;
            rdfs:subClassOf :File ;
            rdfs:label "Document" .

        :Record a owl:Class ;
            rdfs:subClassOf :File ;
            rdfs:label "Record" .

        Refined! Files now have two subtypes. What else?

Example 2: Organizational Concepts

User: The Cabinet Office is a UK government department

System: (extracts)
        ✓ "Cabinet Office" - organization entity
        ✓ "government department" → org:Organization
        ✓ "UK" - geographic qualifier

        Adding:

        :CabinetOffice a org:Organization ;
            rdfs:label "Cabinet Office" ;
            rdfs:comment "UK government department" ;
            dcterms:spatial :UnitedKingdom .

        :UnitedKingdom a dcterms:Location ;
            rdfs:label "United Kingdom" .

        Done! Continue...

User: UKSV is part of the Cabinet Office. They handle security vetting.

System: (extracts two concepts)
        ✓ "UKSV" - new entity
        ✓ "part of" → org:unitOf
        ✓ Cabinet Office already exists ✓
        ✓ "handle security vetting" - indicates purpose/activity

        Adding:

        :UKSV a org:OrganizationalUnit ;
            rdfs:label "UKSV" ;
            org:unitOf :CabinetOffice ;
            org:purpose :SecurityVetting .

        :SecurityVetting a owl:Class ;
            rdfs:subClassOf prov:Activity ;
            rdfs:label "Security Vetting" ;
            rdfs:comment "Process of checking backgrounds for security clearance" .

        Great! UKSV is now linked to Cabinet Office with its purpose defined. More?

Example 3: Temporal Concepts with Refinement

User: clearance means a person's security status

System: (extracts)
        ✓ "clearance" - concept
        ✓ "security status" - state or credential
        ✓ "person" - agent
        ✓ relationship: person has clearance

        Should I model clearance as:
        A) A type of credential/status (skos:Concept in a scheme)
        B) A property that people have

        Which makes more sense?

User: A credential that people have

System: ✓ Using SKOS for credential types
        ✓ person → prov:Agent

        :SecurityClearanceScheme a skos:ConceptScheme ;
            rdfs:label "Security Clearance Levels" .

        :Clearance a owl:Class ;
            rdfs:label "Security Clearance" ;
            rdfs:comment "A person's security status credential" .

        :hasClearance a owl:ObjectProperty ;
            rdfs:domain prov:Agent ;
            rdfs:range :Clearance .

        Got it! What else about clearances?

User: SC clearance lasts for 10 years

System: (extracts)
        ✓ "SC clearance" - specific clearance type
        ✓ "10 years" → time:Duration, ISO 8601: "P10Y"
        ✓ "lasts for" → validity period property

        Adding:

        :SCClearance a :Clearance ;
            skos:inScheme :SecurityClearanceScheme ;
            skos:prefLabel "SC Clearance" ;
            skos:notation "SC" ;
            :validityPeriod "P10Y"^^xsd:duration .

        :validityPeriod a owl:DatatypeProperty ;
            rdfs:label "validity period" ;
            rdfs:domain :Clearance ;
            rdfs:range xsd:duration .

        Perfect! SC Clearance now has 10-year validity (ISO 8601: P10Y).

User: actually, DV clearance also lasts 10 years

System: (extends pattern)
        ✓ Same duration as SC
        ✓ Adding another clearance type

        :DVClearance a :Clearance ;
            skos:inScheme :SecurityClearanceScheme ;
            skos:prefLabel "DV Clearance" ;
            skos:notation "DV" ;
            :validityPeriod "P10Y"^^xsd:duration .

        Added! Both SC and DV clearances have 10-year validity periods.

Example 4: Handling Semantic Ambiguity (Critical!)

[Ontology already has: :File defined as digital document]

User: we need to track physical files too

System: (detects potential conflict)
        I have "file" defined as digital documents (like .docx, .pdf).

        Are you talking about:
        A) The same thing (digital files)
        B) Something different (physical paper files)

User: B - physical files in filing cabinets are different

System: ✓ Creating distinction to handle both meanings:

        # Make File more general
        :File a owl:Class ;
            rdfs:label "File" ;
            rdfs:comment "General concept of file - see specific types" .

        # Original digital concept
        :DigitalFile a owl:Class ;
            rdfs:subClassOf :File, dcterms:Document ;
            rdfs:label "Digital File" ;
            rdfs:comment "Computer file (.docx, .pdf, etc.)" .

        # New physical concept
        :PhysicalFile a owl:Class ;
            rdfs:subClassOf :File ;
            rdfs:label "Physical File" ;
            rdfs:comment "Paper documents in filing cabinets" .

        Perfect! Now the ontology supports both digital and physical files.
        Both are valid meanings of "file" in different contexts.

        What else about physical files?

User: physical files need retention periods too

System: ✓ Already have :hasRetentionPeriod property
        ✓ Updating domain to include both file types

        :hasRetentionPeriod a owl:DatatypeProperty ;
            rdfs:label "has retention period" ;
            rdfs:domain :File ;  # Applies to all file types
            rdfs:range xsd:duration .

        Now both digital AND physical files can have retention periods!

Identifier Generation Strategy

Requirements

Human-readable - Should make sense when viewed in Protégé or TTL
No collisions - Must be unique across the ontology
Consistent - Same entity mentioned twice should get same IRI
URI-safe - No spaces, special characters

Strategy: Namespace + CamelCase + Optional Hash

def generate_iri(label: str, entity_type: str, namespace: str = "http://example.org/ontology#") -> str:
    """
    Generate human-readable, collision-resistant IRI

    Examples:
    - "Cabinet Office" → http://example.org/ontology#CabinetOffice
    - "Data Protection Impact Assessment" → http://example.org/ontology#DataProtectionImpactAssessment
    - "has retention period" → http://example.org/ontology#hasRetentionPeriod
    """
    # Clean and convert to CamelCase or camelCase
    clean = ''.join(word.capitalize() for word in label.split())

    # For properties, use camelCase
    if entity_type == "property":
        clean = clean[0].lower() + clean[1:]

    # Check for collision in existing ontology
    candidate = f"{namespace}{clean}"
    if not iri_exists(candidate):
        return candidate

    # If collision, append hash of full label
    hash_suffix = hashlib.md5(label.encode()).hexdigest()[:6]
    return f"{namespace}{clean}_{hash_suffix}"

Namespace Convention

@prefix : <http://example.org/ontology#> .
@prefix time: <http://www.w3.org/2006/time#> .
@prefix org: <http://www.w3.org/ns/org#> .
@prefix prov: <http://www.w3.org/ns/prov#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

LLM System Prompt Design

The conversational agent should use a system prompt that:

Core Instructions

You are a semantic knowledge extraction assistant. Your role is to help users build a formal RDF ontology by **proactively extracting definitions and meanings** from natural conversation.

## Your Approach: Proactive Definition Extraction

**Listen for definitional patterns** in user statements:
- "X is Y" → X is a type of Y (rdfs:subClassOf)
- "X is how we Y" → Extract X as concept with purpose Y
- "X means Y" → Definition or alternative label
- "X has Y" → Property relationship
- "X can be Y or Z" → Subtypes
- "X is part of Y" → Structural relationship

**Extract proactively** - don't wait to be asked. When someone says "retention is how we measure how long files are kept", immediately:
1. Identify concepts: retention, file, measurement, duration
2. Extract relationships: retention → measures → duration → for files
3. Ground in standards: duration → time:Duration, file → dcterms:Document
4. Build triples and show user

**Only ask questions when**:
- Definition is genuinely unclear or ambiguous
- Multiple grounding options exist (e.g., "Is X a type or instance?")
- Need to resolve which standard ontology to use

**Remember and use context**:
- Once "file" is defined, use that definition consistently
- When user refines ("files can be documents or records"), update the ontology
- Build on previous definitions without re-asking

**Handle semantic flexibility (CRITICAL)**:
- The same word can have multiple valid meanings in different contexts
- If a term exists but user's usage doesn't match, ask: "I have [term] as [existing definition]. Are you talking about that, or something different?"
- If different, create disambiguated concepts (e.g., :DigitalFile vs :PhysicalFile)
- NEVER rigidly enforce first definition as only definition
- Multiple senses are normal - "file" can be digital file OR physical file, both valid

## Your Responsibilities
1. **Proactively extract** definitions, concepts, and relationships from user statements
2. **Ground in standards** - link to OWL-Time, org:, PROV-O, SKOS, Dublin Core
3. **Generate valid RDF** using proper Turtle syntax
4. **Show your work** - display the triples you're creating
5. **Use ISO 8601** for all temporal durations (P7Y for 7 years, P12M for 12 months)
6. **Be conversational** - friendly tone, celebrate discoveries, encourage continuation

## Standard Ontologies You Know
- **OWL-Time (time:)** - temporal concepts, durations, relations (ISO 8601)
- **Organization Ontology (org:)** - organizations, units, roles, membership
- **PROV-O (prov:)** - activities, agents, entities, provenance
- **SKOS (skos:)** - concept schemes, labels, taxonomies
- **Dublin Core (dcterms:)** - metadata, documents, descriptions
- **RDFS/OWL** - classes, properties, hierarchies, annotations

## Definition Extraction Patterns

**Pattern: "X is Y"**

User: "retention is a measure of duration" → Extract: retention as concept, relates to duration (time:Duration)


**Pattern: "X is how we Y"**

User: "retention is how we measure how long files are kept" → Extract: retention (concept), file (entity), temporal measurement (time:Duration)


**Pattern: "X has Y"**

User: "files have creation dates" → Create property: :hasCreationDate, domain: :File, range: xsd:dateTime


**Pattern: "X can be Y or Z"**

User: "files can be documents or records" → Create subtypes: :Document and :Record as subclasses of :File


**Pattern: Refinement**

User already said: "a file is a digital document" User now says: "actually files can be records too" → Update ontology: add :Record as another subtype


## Output Format
Always show the Turtle triples you're generating:

:EntityName a owl:Class ;
    rdfs:subClassOf standardOntology:Concept ;
    rdfs:label "Human Label" ;
    rdfs:comment "Description from context" .

Mark grounding with ✓: "✓ Using org:Organization"

Be conversational, precise, and encouraging! This is collaborative knowledge building.

Implementation Requirements

Phase 1: Proactive Definition Extraction

Load standard ontology definitions (hardcoded mappings to W3C specs)
Initialize empty domain ontology (TTL file)
Accept user input (natural conversational statement)
Proactively extract definitions using pattern matching:
- "X is Y" patterns
- "X is how we..." patterns
- "X has Y" patterns
- "X can be Y or Z" patterns
- "X means Y" patterns
For each extracted concept:
- Check if already in ontology (use existing or refine)
- Extract context and meaning from statement
- Identify grounding in standard ontologies
Generate triples with grounding
Append to domain_ontology.ttl
Show user what was extracted and added
Continue conversation naturally

Phase 2: Temporal & Units Handling

Detect temporal expressions proactively ("7 years", "12 months", "lasts for X")
Always convert to ISO 8601 durations (P7Y, P12M, P30D)
Ground in OWL-Time concepts (time:Duration, time:Instant)
Create temporal properties as needed (validityPeriod, retentionPeriod)
Handle temporal relations ("before X", "after Y") using time:before, time:after

Phase 3: Relationship Extraction

Detect relationships in natural statements proactively
Common patterns:
- "X is part of Y" → org:unitOf or dcterms:isPartOf
- "X reports to Y" → org:reportsTo
- "X has Y" → create property :hasY
- "X handles Y" / "X does Y" → org:purpose or prov:activity
Always ground properties in standard ontologies when possible
Create custom properties only when no standard exists

Phase 4: Context Memory & Refinement

Maintain conversation context (remember what's been defined)
When user refines earlier definition, update triples (don't duplicate)
Handle additive refinements: "files can also be records"
Handle corrections: "actually, X is really Y, not Z"
Show what changed when refining

Phase 4.5: Semantic Flexibility & Disambiguation (CRITICAL)

Detect conflicting usage: When user mentions existing term but context doesn't match
Ask clarifying question: "I have X as [definition]. Same thing or different?"
Create disambiguated concepts if different:
- Qualified names: :DigitalFile vs :PhysicalFile (not File1, File2)
- Update original to be more general if needed
- Ground each sense in appropriate standard ontology
Handle polysemy gracefully: Same word, multiple valid meanings = normal
Test cases:
- "file" as digital document, then "physical files"
- "record" as database record, then "legal records"
- "clearance" as security credential, then "physical clearance"

Phase 5: Validation & Quality

Validate generated Turtle syntax (parse with RDFLib)
Check grounding coverage (every concept links to standard ontology)
Validate ISO 8601 duration syntax
Ensure all entities have rdfs:label and rdfs:comment
Report ontology statistics after additions (X classes, Y properties, Z% grounded)

Success Criteria

A successful implementation should:

Functional Requirements

✅ Start with zero domain knowledge (empty ontology file) ✅ Proactively extract definitions from natural conversational statements ✅ Recognize definitional patterns ("X is Y", "X is how we...", "X has Y") ✅ Ground all concepts in standard W3C ontologies (time:, org:, prov:, skos:, dcterms:) ✅ Generate valid Turtle/RDF syntax ✅ Use human-readable IRIs without collisions ✅ Handle temporal expressions with ISO 8601 (P7Y, P12M) ✅ Remember context and use definitions consistently ✅ Allow refinement of definitions ("files can also be records") ✅ Handle semantic flexibility - detect when same word has different meanings, disambiguate gracefully ✅ Never treat first definition as immutable - ask when usage doesn't match existing definition ✅ Persist growing ontology to TTL file after each addition ✅ Show user what's being extracted and added in real-time

Quality Requirements

✅ Every new class/property traces back to standard ontology ✅ Uses standard namespaces exclusively for grounding ✅ All temporal durations in ISO 8601 (P7Y, never "7 years") ✅ Consistent identifier generation (same term → same IRI across conversation) ✅ Valid RDF (parseable by RDFLib, viewable in Protégé) ✅ Human-readable IRIs (CamelCase for classes, camelCase for properties) ✅ All entities have rdfs:label and rdfs:comment ✅ No orphaned concepts (all grounded in standards)

User Experience

✅ Proactive, not reactive - extracts without being explicitly asked ✅ Conversational and friendly tone ✅ Explains grounding decisions clearly ("✓ Using org:Organization") ✅ Shows generated triples after each extraction ✅ Asks questions only when genuinely unclear ✅ Remembers what's been defined (doesn't re-ask) ✅ Handles refinements gracefully ("updating earlier definition...") ✅ Encourages continuation ("What else?", "Tell me more...") ✅ Celebrates discoveries ("Perfect! Adding...", "Got it!")

Example End-to-End Session

System: Hi! I'm here to help you build a formal ontology through conversation.
        I start knowing only standard W3C vocabularies (Organizations, Time, Provenance).
        Tell me about your domain, and I'll ask questions to understand it properly.

User: The Cabinet Office is a UK government department

System: Great! Let me add that...
        ✓ "government department" → org:Organization (W3C Organization Ontology)
        ✓ "UK" → I'll add a geographic qualifier

        Adding:
        :CabinetOffice a org:Organization ;
            rdfs:label "Cabinet Office" ;
            rdfs:comment "UK government department" ;
            dcterms:spatial :UnitedKingdom .

        :UnitedKingdom a dcterms:Location ;
            rdfs:label "United Kingdom" .

        Done! What else can you tell me about the Cabinet Office?

User: UKSV is part of the Cabinet Office and handles security vetting

System: Let me unpack that...
        ✓ "part of" → org:unitOf
        ✓ "handles" suggests an activity/responsibility

        What is UKSV - an organizational unit, team, or something else?

User: It's an organizational unit

System: Perfect!
        ✓ org:OrganizationalUnit

        And "security vetting" - is that a process/activity they perform?

User: Yes, it's the process of checking people's backgrounds

System: ✓ Process → prov:Activity

        Adding:
        :UKSV a org:OrganizationalUnit ;
            rdfs:label "UKSV" ;
            org:unitOf :CabinetOffice ;
            org:purpose :SecurityVetting .

        :SecurityVetting a owl:Class ;
            rdfs:subClassOf prov:Activity ;
            rdfs:label "Security Vetting" ;
            rdfs:comment "Process of checking people's backgrounds for security clearance" .

        Excellent! Your ontology now has:
        - 2 organizations (Cabinet Office, UKSV)
        - 1 activity type (Security Vetting)
        - All grounded in W3C standards ✓

        What else?

User: People need SC clearance which lasts for 10 years

System: Let me break this down...
        ✓ "SC clearance" sounds like a credential or status - is that right?

User: Yes, it's a security clearance level

System: ✓ I'll model as skos:Concept in a clearance scheme
        ✓ "lasts for 10 years" → time:Duration with ISO 8601

        Adding:
        :SecurityClearanceScheme a skos:ConceptScheme ;
            rdfs:label "Security Clearance Levels" .

        :SCClearance a skos:Concept ;
            skos:inScheme :SecurityClearanceScheme ;
            skos:prefLabel "SC Clearance" ;
            skos:notation "SC" ;
            :validityPeriod "P10Y"^^xsd:duration .  # ISO 8601: 10 years

        :validityPeriod a owl:DatatypeProperty ;
            rdfs:label "validity period" ;
            rdfs:domain skos:Concept ;
            rdfs:range time:Duration .

        Done! SC Clearance is now in your ontology with a 10-year validity period (ISO 8601: P10Y).

        Continue building?

CLI Interface

Provide a simple CLI to run conversations:

# Start fresh conversation
python -m ontology_builder.cli --new-ontology domain.ttl

# Continue building existing ontology
python -m ontology_builder.cli --ontology domain.ttl

# Validate current ontology
python -m ontology_builder.cli --validate domain.ttl

# Show ontology stats
python -m ontology_builder.cli --stats domain.ttl

Development Notes

Testing Strategy

Definition extraction tests - Given statement "X is Y", extract correct relationship
Pattern matching tests - Test each definitional pattern ("X is how we...", "X has Y")
Grounding tests - Ensure concepts ground in correct standard ontology
Identifier tests - Human-readable generation, collision detection, consistency
Temporal tests - "7 years" → "P7Y", "12 months" → "P12M"
Refinement tests - Update ontology when definition is refined, don't duplicate
Context memory tests - Use earlier definitions consistently in later extractions
Integration tests - Full conversation flows from examples
Validation tests - All generated TTL parses correctly with RDFLib
Mock LLM responses for deterministic testing

Code Quality

Type hints throughout
Docstrings with examples
Logging of all ontology changes
Version control the growing ontology

Future Extensions (Out of Scope for Now)

Export to other formats (JSON-LD, RDF/XML)
Import existing ontologies to extend
SPARQL query interface
Triplestore integration (Blazegraph, Oxigraph)
Web UI for conversation
Integration with policy translation system

Dependencies

# requirements.txt
rdflib>=7.0.0
openai>=1.0.0  # or anthropic if using Claude
python-dotenv>=1.0.0
click>=8.0.0  # for CLI
rich>=13.0.0  # for nice terminal output
pyyaml>=6.0.0
isodate>=0.6.0  # for ISO 8601 duration parsing/validation

Getting Started for Builder

Read this spec thoroughly - Understand the proactive extraction approach
Study standard ontologies - Review W3C specs for time:, org:, prov:, skos:, dcterms:
Analyze conversation examples - See how definitions are extracted from natural statements
Understand definitional patterns:
- "X is Y" → subclass relationship
- "X is how we Y" → concept with purpose
- "X has Y" → property creation
- "X can be Y or Z" → taxonomy building
Implement in this order: a. Standard ontology loader (hardcoded mappings to W3C specs) b. Identifier generator (human-readable, collision-free IRIs) c. Definition extractor (pattern matching for definitional statements) d. Grounding engine (map extracted concepts to standard ontologies) e. Ontology manager (RDFLib operations, TTL serialization) f. Conversational agent (LLM integration with proactive extraction prompts) g. CLI interface (conversation runner)
Test with examples - Run through the conversation examples in this doc
Validate outputs - Ensure generated TTL is valid RDF (parse with RDFLib)

Key Architectural Decisions

Proactive extraction, not reactive - Extract definitions from statements, don't wait to be asked
Definitional focus, not policy - Building semantic knowledge, not translating rules
File-based TTL, not triplestore - Simple persistence for now (can scale later)
Standard ontologies hardcoded - Reference by namespace, no downloads needed
ISO 8601 mandatory for temporal - All durations as P7Y, P12M, never free text
Every concept grounds in standards - No orphaned entities, all trace to W3C ontologies
Human-readable IRIs - CamelCase classes, camelCase properties, collision detection
Conversational memory - Remember definitions, use consistently, allow refinement
Semantic flexibility - Same word can have multiple valid meanings (polysemy), disambiguate when needed
Never rigid definitions - First definition isn't gospel, ask when usage conflicts
LLM as knowledge mediator - Bridges natural language ↔ formal semantics
Zero assumed knowledge - Start from empty ontology, build collaboratively

Design Philosophy

The system should feel like a collaborative conversation with a semantic expert who helps you formalize your domain knowledge, not a form-filling exercise or question-answer bot.

Key principles:

Listen proactively for definitions
Extract meaning without requiring formal language
Ground in established standards for interoperability
Remember context across the conversation
Allow natural refinement and correction
Embrace semantic flexibility - words can have multiple valid meanings
Never treat first definition as immutable truth
Disambiguate gracefully when conflicts arise
Show what's being built in real-time
Encourage exploration and elaboration

This specification is complete and actionable. A new Claude Code instance should be able to implement the conversational ontology builder from this document.

Focus: Semantic knowledge extraction through natural dialogue, not policy translation.

FilesExpand file tree

project_plan.md

Latest commit

History