Skip to content

Improve Project Documentation for Better Contributor Onboarding #22

@MNIKIEMA

Description

@MNIKIEMA

Improve Project Documentation for Better Contributor Onboarding

Problem

Our documentation is fragmented and lacks a unified structure, making it difficult for new and existing contributors to understand how components interact. This leads to wasted time and stalled progress.

Key Issues:

  • Minimal root README with no complete system overview
  • Disconnected submodule documentation (rag_system, src, pipeline_scripts, taxonomy)
  • No architecture diagram showing how modules interact
  • Missing standardized onboarding guide for setup, testing, and deployment

Proposed Solution

1. Create Centralized Documentation Structure

Add a /docs folder as the single source of truth:

docs/
├── index.md                    # Project introduction and quick links
├── architecture/
│   ├── overview.md            # High-level architecture + component diagram
│   ├── data-flow.md           # How data moves through the system
│   └── components.md          # Detailed component descriptions
├── getting-started/
│   ├── installation.md        # Environment setup (poetry/uv)
│   ├── quick-start.md         # Run your first pipeline
│   └── configuration.md       # Settings and environment variables
├── guides/
│   ├── pdf-extraction.md      # Using the PDF extraction module
│   ├── rag-system.md          # Working with kotaemon RAG
│   ├── ingestion-pipeline.md # Running ingestion workflows
│   ├── taxonomy.md            # Understanding taxonomies
│   └── scraping.md            # OpenAlex data extraction
├── development/
│   ├── setup.md               # Dev environment setup
│   ├── contributing.md        # Contribution guidelines
│   ├── testing.md             # Running tests
│   └── project-structure.md   # Directory layout explanation
├── deployment/
│   ├── docker.md              # Docker setup and compose
│   └── production.md          # Production considerations
└── reference/
    ├── api.md                 # Code API reference
    └── scripts.md             # Available scripts and their usage

2. Add Architecture Diagram

Create a visual data flow diagram (Mermaid or image) showing:

  • OpenAlex scraping → PDF extraction → Taxonomy classification → RAG ingestion
  • How src/, rag_system/, pipeline_scripts/, and taxonomy/ interact

3. Update Root README

Make it a lightweight entry point with:

  • Project overview and mission
  • Architecture diagram
  • Quick start commands
  • Links to full documentation in /docs

4. Standardize Subproject READMEs

Each main directory (src/, rag_system/, pipeline_scripts/, taxonomy/) should include:

  • Purpose and scope
  • Key files/scripts
  • Connection to other modules
  • Usage examples
  • Link back to /docs for detailed documentation

Expected Outcome

✅ Clear, unified documentation structure
✅ Faster contributor onboarding
✅ Easier project navigation and maintenance
✅ Better understanding of system architecture and data flow

Affected Areas

  • Root README.md
  • New /docs folder structure
  • src/, rag_system/, pipeline_scripts/, taxonomy/ READMEs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions