Notemd is a comprehensive knowledge management system designed for academic and technical documentation. It provides automated processing of PDF documents into structured Markdown with Obsidian integration, featuring:
- Multi-stage document processing pipeline
- Intelligent knowledge point extraction
- Automated backlink generation for knowledge graphs
- Advanced duplicate detection algorithms
- Multi-LLM provider support (DeepSeek/OpenAI/Anthropic)
- Generate-Documentation.ps1: Core processing engine for individual markdown files
- Process-PDFPipeline.ps1: Batch processor with advanced file handling
- DeleteDuplicates.ps1: Advanced duplicate detection and cleanup
- process.py: PDF processing core
- generate.py: Documentation generation
- clean.py: Duplicate detection and cleanup
Notemd-git/
├── Notemd/ # Core Python package
│ ├── __init__.py # Package initialization
│ ├── clean.py # Duplicate detection
│ ├── generate.py # Documentation generation
│ ├── process.py # PDF processing core
│ └── scripts/ # PowerShell automation scripts
├── requirements.txt # Python dependencies
├── setup.py # Installation script
├── *.ps1 # Processing scripts
└── .env.example # Configuration template
- PowerShell 7.2+ (required for advanced scripting)
- Python 3.10+ (for document processing)
- LLM API Key (DeepSeek/OpenAI/Anthropic)
- Obsidian (for knowledge graph visualization)
- Clone repository:
git clone https://github.com/Jacobinwwey/Notemd.git
cd Notemd
- Create virtual environment:
python -m venv venv
source venv/bin/activate # Linux/Mac
venv\Scripts\activate # Windows
- Install dependencies:
pip install -r requirements.txt
pip install -e .
- Configuration:
cp .env.example .env
# Configure paths and API keys in .env
- Converts PDF to clean Markdown with preserved structure
- Intelligent chunking for large documents (3000 words/chunk)
- Mathematical notation preservation
- Automated header normalization
- AI-powered concept identification (DeepSeek/OpenAI)
- Obsidian backlink generation
- Knowledge graph node creation
- Technical terminology handling
- Batch Processing: Automatic retries and timeout handling
- Error Recovery: Robust logging and resume capabilities
- Duplicate Detection: Symbol normalization and containment checks
- Scheduling: Configurable processing intervals and cycles
# Required Paths
KNOWLEDGE_BASE_PATH=/path/to/knowledge_base
SEARCH_PATH=/path/to/search/files
# LLM Configuration
LLM_PROVIDER=deepseek
DEEPSEEK_API_KEY=your_api_key
DEEPSEEK_MODEL=deepseek-reasoner
# Processing Parameters
CHUNK_SIZE=3000 # Words per processing chunk
TEMPERATURE=0.5 # AI response creativity (0.0-1.0)
MAX_TOKENS=8192 # Maximum tokens per request
# Scheduling (Process-PDFPipeline.ps1)
START_DELAY_HOURS=0.000001
TIMEOUT_HOURS=8
CHECK_INTERVAL=30
MAX_CYCLES=1000
notemd-process research_paper.pdf --output-dir ./knowledge_base
.\Process-PDFPipeline.ps1 -InputDir ./papers -OutputDir ./knowledge
notemd-generate --model deepseek-reasoner --temperature 0.7
- Open Obsidian and create/open a vault
- Add the knowledge base directory to your vault
- Navigate the automatically generated knowledge graph
Edit the $structuredPrompt
variable in scripts to modify concept identification.
- Add new provider in Process-PDFPipeline.ps1
- Implement API calls in generate.py
- Update configuration system
- API Timeouts: Increase timeout values in configuration
- Encoding Errors: Ensure UTF-8 file handling
- Missing Backlinks: Verify content contains identifiable concepts
- Duplicate Detection: Adjust similarity thresholds in clean.py
- processing_errors.log - Detailed error information
- processed.log - Completed file records
python setup.py sdist bdist_wheel
python -m pytest tests/
MIT License - See LICENSE for details
- Multi-language support
- Enhanced mathematical processing
- Plugin architecture
- Web interface
- Mobile app integration