Skip to content

Conversation

@mcepl
Copy link

@mcepl mcepl commented Dec 16, 2025

This feature addresses the GitHub discussion issue #3290 (comment)

Previously, Markdown files without explicit 'Title:' metadata would fail to get titles extracted from their first heading, unlike RST files which support both metadata and heading-based title extraction.

New Configuration Options:

  • HEADING_METADATA: Enable/disable heading metadata extraction
  • HEADING_METADATA_MAP: Map heading levels to metadata fields
  • HEADING_METADATA_PATTERNS: Custom regex patterns for extraction

Key Features:

  • Extract metadata from markdown headings (# ## ### etc.)
  • Support custom patterns (e.g., "### Author: Name")
  • Metadata processors integration
  • Regular metadata takes precedence over heading metadata
  • Unicode and special character support
  • Graceful error handling

Implementation Details:

  • Added _extract_heading_metadata() method to MarkdownReader
  • Modified read() method to extract headings before markdown conversion
  • Backward compatible - no existing functionality changed
  • Performance optimized with regex-based parsing

Test Coverage:

  • Basic title/subtitle/summary extraction
  • Unicode character support (Czech, accented)
  • Metadata priority override behavior
  • Custom pattern matching
  • Edge cases and error handling
  • Performance benchmarking (936KB/s throughput)
  • Real-world scenario validation

Resolves issue where markdown files like:
# My Article Title
Date: 2023-12-01
Category: tech

Can now have titles extracted without manual Title: metadata.

Files modified:

  • pelican/settings.py: Add new configuration options
  • pelican/readers.py: Implement heading extraction functionality

Pull Request Checklist

Resolves: #issue-number-here

  • Ensured tests pass and (if applicable) updated functional test output
  • Conformed to code style guidelines by running appropriate linting tools
  • Added tests for changed code
  • Updated documentation for changed code

@mcepl mcepl force-pushed the HEADING_METADATA branch 3 times, most recently from 756aa04 to aaf2a4a Compare December 16, 2025 23:14
Previously, Markdown files without explicit 'Title:' metadata would fail
to get titles extracted from their first heading, unlike RST files which
support both metadata and heading-based title extraction.

New Configuration Options:
- HEADING_METADATA: Enable/disable heading metadata extraction
- HEADING_METADATA_MAP: Map heading levels to metadata fields
- HEADING_METADATA_PATTERNS: Custom regex patterns for extraction

Resolves issue where markdown files like:

    Date: 2023-12-01
    Category: tech

    # My Article Title

Can now have titles extracted without manual Title: metadata.

Fixes also the issue where Markdown articles showed duplicate titles
when HEADING_METADATA=True was set in configuration.

References: getpelican#3290
Signed-off-by: Matěj Cepl <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant