Notemd - Advanced Knowledge Processing System

Overview

Notemd is a comprehensive knowledge management system designed for academic and technical documentation. It provides automated processing of PDF documents into structured Markdown with Obsidian integration, featuring:

Multi-stage document processing pipeline
Intelligent knowledge point extraction
Automated backlink generation for knowledge graphs
Advanced duplicate detection algorithms
Multi-LLM provider support (DeepSeek/OpenAI/Anthropic)

System Components

PowerShell Scripts

Generate-Documentation.ps1: Core processing engine for individual markdown files
Process-PDFPipeline.ps1: Batch processor with advanced file handling
DeleteDuplicates.ps1: Advanced duplicate detection and cleanup

Python Modules

process.py: PDF processing core
generate.py: Documentation generation
clean.py: Duplicate detection and cleanup

Project Structure

Notemd-git/
├── Notemd/                  # Core Python package
│   ├── __init__.py          # Package initialization
│   ├── clean.py             # Duplicate detection
│   ├── generate.py          # Documentation generation
│   ├── process.py           # PDF processing core
│   └── scripts/             # PowerShell automation scripts
├── requirements.txt         # Python dependencies
├── setup.py                 # Installation script
├── *.ps1                    # Processing scripts
└── .env.example             # Configuration template

Installation & Setup

Prerequisites

PowerShell 7.2+ (required for advanced scripting)
Python 3.10+ (for document processing)
LLM API Key (DeepSeek/OpenAI/Anthropic)
Obsidian (for knowledge graph visualization)

Step-by-Step Installation

Clone repository:

git clone https://github.com/Jacobinwwey/Notemd.git
cd Notemd

Create virtual environment:

python -m venv venv
source venv/bin/activate  # Linux/Mac
venv\Scripts\activate     # Windows

Install dependencies:

pip install -r requirements.txt
pip install -e .

Configuration:

cp .env.example .env
# Configure paths and API keys in .env

Core Features

PDF Processing Pipeline

Converts PDF to clean Markdown with preserved structure
Intelligent chunking for large documents (3000 words/chunk)
Mathematical notation preservation
Automated header normalization

Knowledge Extraction

AI-powered concept identification (DeepSeek/OpenAI)
Obsidian backlink generation
Knowledge graph node creation
Technical terminology handling

Advanced Script Capabilities

Batch Processing: Automatic retries and timeout handling
Error Recovery: Robust logging and resume capabilities
Duplicate Detection: Symbol normalization and containment checks
Scheduling: Configurable processing intervals and cycles

Configuration

Environment Variables (.env)

# Required Paths
KNOWLEDGE_BASE_PATH=/path/to/knowledge_base
SEARCH_PATH=/path/to/search/files

# LLM Configuration
LLM_PROVIDER=deepseek
DEEPSEEK_API_KEY=your_api_key
DEEPSEEK_MODEL=deepseek-reasoner

# Processing Parameters
CHUNK_SIZE=3000          # Words per processing chunk
TEMPERATURE=0.5          # AI response creativity (0.0-1.0)
MAX_TOKENS=8192          # Maximum tokens per request

# Scheduling (Process-PDFPipeline.ps1)
START_DELAY_HOURS=0.000001
TIMEOUT_HOURS=8
CHECK_INTERVAL=30
MAX_CYCLES=1000

Usage Examples

Single File Processing

notemd-process research_paper.pdf --output-dir ./knowledge_base

Batch Processing

.\Process-PDFPipeline.ps1 -InputDir ./papers -OutputDir ./knowledge

Documentation Generation

notemd-generate --model deepseek-reasoner --temperature 0.7

Obsidian Integration

Open Obsidian and create/open a vault
Add the knowledge base directory to your vault
Navigate the automatically generated knowledge graph

Advanced Configuration

Customizing AI Prompts

Edit the $structuredPrompt variable in scripts to modify concept identification.

Extending LLM Support

Add new provider in Process-PDFPipeline.ps1
Implement API calls in generate.py
Update configuration system

Troubleshooting

Common Issues

API Timeouts: Increase timeout values in configuration
Encoding Errors: Ensure UTF-8 file handling
Missing Backlinks: Verify content contains identifiable concepts
Duplicate Detection: Adjust similarity thresholds in clean.py

Log Files

processing_errors.log - Detailed error information
processed.log - Completed file records

Development

Building from Source

python setup.py sdist bdist_wheel

Testing

python -m pytest tests/

License

MIT License - See LICENSE for details

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Notemd		Notemd
.env		.env
.env.example		.env.example
.gitignore		.gitignore
DeleteDuplicates.ps1		DeleteDuplicates.ps1
Generate-Documentation.ps1		Generate-Documentation.ps1
LICENSE		LICENSE
MANIFEST		MANIFEST
MANIFEST.in		MANIFEST.in
Process-PDFPipeline.ps1		Process-PDFPipeline.ps1
README.md		README.md
install.bat		install.bat
install.md		install.md
requirements.txt		requirements.txt
run.txt		run.txt
setup.py		setup.py

License

Jacobinwwey/Notemd

Folders and files

Latest commit

History

Repository files navigation

Notemd - Advanced Knowledge Processing System

Overview

System Components

PowerShell Scripts

Python Modules

Project Structure

Installation & Setup

Prerequisites

Step-by-Step Installation

Core Features

PDF Processing Pipeline

Knowledge Extraction

Advanced Script Capabilities

Configuration

Environment Variables (.env)

Usage Examples

Single File Processing

Batch Processing

Documentation Generation

Obsidian Integration

Advanced Configuration

Customizing AI Prompts

Extending LLM Support

Troubleshooting

Common Issues

Log Files

Development

Building from Source

Testing

License

Roadmap

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages