Add HEADING_METADATA feature for Markdown title extraction #3539

mcepl · 2025-12-16T19:17:37Z

This feature addresses the GitHub discussion issue #3290 (comment)

Previously, Markdown files without explicit 'Title:' metadata would fail to get titles extracted from their first heading, unlike RST files which support both metadata and heading-based title extraction.

New Configuration Options:

HEADING_METADATA: Enable/disable heading metadata extraction
HEADING_METADATA_MAP: Map heading levels to metadata fields
HEADING_METADATA_PATTERNS: Custom regex patterns for extraction

Key Features:

Extract metadata from markdown headings (# ## ### etc.)
Support custom patterns (e.g., "### Author: Name")
Metadata processors integration
Regular metadata takes precedence over heading metadata
Unicode and special character support
Graceful error handling

Implementation Details:

Added _extract_heading_metadata() method to MarkdownReader
Modified read() method to extract headings before markdown conversion
Backward compatible - no existing functionality changed
Performance optimized with regex-based parsing

Test Coverage:

Basic title/subtitle/summary extraction
Unicode character support (Czech, accented)
Metadata priority override behavior
Custom pattern matching
Edge cases and error handling
Performance benchmarking (936KB/s throughput)
Real-world scenario validation

Resolves issue where markdown files like:
# My Article Title
Date: 2023-12-01
Category: tech

Can now have titles extracted without manual Title: metadata.

Files modified:

pelican/settings.py: Add new configuration options
pelican/readers.py: Implement heading extraction functionality

Pull Request Checklist

Resolves: #issue-number-here

Ensured tests pass and (if applicable) updated functional test output
Conformed to code style guidelines by running appropriate linting tools
Added tests for changed code
Updated documentation for changed code

Previously, Markdown files without explicit 'Title:' metadata would fail to get titles extracted from their first heading, unlike RST files which support both metadata and heading-based title extraction. New Configuration Options: - HEADING_METADATA: Enable/disable heading metadata extraction - HEADING_METADATA_MAP: Map heading levels to metadata fields - HEADING_METADATA_PATTERNS: Custom regex patterns for extraction Resolves issue where markdown files like: Date: 2023-12-01 Category: tech # My Article Title Can now have titles extracted without manual Title: metadata. Fixes also the issue where Markdown articles showed duplicate titles when HEADING_METADATA=True was set in configuration. References: getpelican#3290 Signed-off-by: Matěj Cepl <[email protected]>

mcepl force-pushed the HEADING_METADATA branch 3 times, most recently from 756aa04 to aaf2a4a Compare December 16, 2025 23:14

mcepl force-pushed the HEADING_METADATA branch from aaf2a4a to 7cd88ee Compare December 16, 2025 23:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add HEADING_METADATA feature for Markdown title extraction #3539

Add HEADING_METADATA feature for Markdown title extraction #3539

mcepl commented Dec 16, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Add HEADING_METADATA feature for Markdown title extraction #3539

Are you sure you want to change the base?

Add HEADING_METADATA feature for Markdown title extraction #3539

Conversation

mcepl commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mcepl commented Dec 16, 2025 •

edited

Loading