A powerful Rust CLI tool for splitting markdown documents into multiple files based on page breaks. Supports both local files and remote URLs with configurable splitting strategies.
- Multiple Input Sources: Support for local files and HTTP/HTTPS URLs
- Flexible Page Detection: Automatic detection of page breaks using various patterns
- Custom Page Markers: Define your own page break patterns
- Batch Processing: Process multiple markdown files simultaneously
- Smart Splitting: Calculate optimal page distribution across splits
- Metadata Generation: Optional metadata files with split information
- Structure Preservation: Maintain document structure with separators
- Analysis Mode: Analyze documents without splitting
- Validation: Verify input sources before processing
# Clone the repository
git clone <repository-url>
cd markdown-splitter
# Build the project
cargo build --release
# The binary will be available at target/release/md-splitSplit a single markdown file into 5 parts:
md-split split document.md --splits 5Split multiple files:
md-split split file1.md file2.md file3.md --splits 3Split from URLs:
md-split split https://raw.githubusercontent.com/user/repo/main/README.md --splits 4Mix local files and URLs:
md-split split local-file.md https://example.com/remote.md --splits 2Specify custom output directory:
md-split split document.md --splits 5 --output ./my-outputUse custom page break marker:
md-split split document.md --splits 3 --page-marker "<!-- SPLIT HERE -->"Force overwrite existing files:
md-split split document.md --splits 5 --forceDisable structure preservation:
md-split split document.md --splits 5 --preserve-structure falseSkip metadata generation:
md-split split document.md --splits 5 --include-metadata falseAnalyze documents without splitting:
md-split analyze document.mdDetailed analysis with page information:
md-split analyze document.md --detailedSave analysis to JSON:
md-split analyze document.md --json-output analysis.jsonValidate input sources:
md-split validate file1.md https://example.com/file2.mdCheck accessibility:
md-split validate file1.md --check-accessThe tool automatically detects page breaks using these patterns:
- Horizontal Rules:
---,***,___ - HTML Comments:
<!-- page break -->,<!-- pagebreak --> - LaTeX Commands:
\pagebreak,\newpage - Headers: Any markdown header (
#,##, etc.) - Custom Markers: User-defined regex patterns
You can define custom page break patterns using regex:
# Split on custom HTML comments
md-split split document.md --page-marker "<!-- NEW PAGE -->" --splits 3
# Split on specific markdown syntax
md-split split document.md --page-marker "^=== BREAK ===$" --splits 4When splitting document.md into 3 parts, the output structure will be:
output/
├── document_split_1_of_3.md
├── document_split_2_of_3.md
├── document_split_3_of_3.md
└── document_metadata.json (if --include-metadata)
{
"source": "document.md",
"total_pages": 15,
"total_splits": 3,
"split_files": ["document_split_1_of_3.md", "document_split_2_of_3.md", "document_split_3_of_3.md"],
"document_metadata": {
"filename": "document.md",
"source_type": "LocalFile",
"created_at": "2025-01-15T10:30:00Z",
"total_lines": 500,
"page_breaks": [0, 120, 250, 380, 500]
},
"split_info": [
{
"split_number": 1,
"filename": "document_split_1_of_3.md",
"path": "./output/document_split_1_of_3.md"
}
]
}Split a large academic paper into sections:
md-split split research-paper.md --splits 4 --output ./paper-sectionsProcess multiple documentation files:
md-split split \
docs/intro.md \
docs/tutorial.md \
docs/advanced.md \
--splits 2 \
--preserve-structure true \
--output ./split-docsSplit content directly from GitHub:
md-split split \
https://raw.githubusercontent.com/rust-lang/book/main/src/README.md \
--splits 3 \
--output ./rust-book-splitsUse custom markers for specialized documents:
md-split split manual.md \
--page-marker "^<!-- CHAPTER .* -->$" \
--splits 5 \
--detailedAnalyze before splitting to determine optimal split count:
# First analyze
md-split analyze large-document.md --detailed
# Then split based on analysis
md-split split large-document.md --splits 8The tool provides detailed error messages for common issues:
- File not found: Validates local file paths
- URL access: Checks remote URL accessibility
- Invalid regex: Validates custom page markers
- Empty documents: Handles documents with no detectable pages
- Output conflicts: Prevents accidental overwrites (use
--force)
Enable verbose logging for debugging:
md-split split document.md --splits 5 --verbose- Large Files: The tool handles large files efficiently by streaming content
- Multiple Files: Processes files sequentially to manage memory usage
- Remote URLs: Caches remote content temporarily during processing
- Output Directory: Ensure sufficient disk space for split files
- Enhanced Split Markers: Split markers now include the document name for better context
- Format changed from
<!-- Split containing pages 1 to 249 --> - To
<!-- EN 1993-1-2-2005 Split containing pages 1 to 249 --> - Automatically removes
_structured_markdownsuffix from document names - Provides better contextual information when working with multiple split documents
- Format changed from
- Initial release with core splitting functionality
- Support for local files and HTTP/HTTPS URLs
- Flexible page detection with multiple patterns
- Configurable splitting strategies
- Metadata generation and structure preservation
- Analysis and validation modes
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests for new functionality
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.