Skip to content

Commit 6f54307

Browse files
committed
Add markdown regeneration functionality
1 parent 978a482 commit 6f54307

19 files changed

Lines changed: 381 additions & 35 deletions

File tree

README.md

Lines changed: 27 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ A parser for extracting headings and hierarchical structure from Markdown files.
1313

1414
- Parse multiple heading formats (hash `#`, asterisk `**`, inline with colon, all-caps)
1515
- Build hierarchical structure from headings
16-
- **Fuzzy heading matching** to extract expected headings from improperly formatted documents
16+
- **Fuzzy heading matching** to extract expected headings from improperly formatted documents, even with typos or spelling variations
1717
- Process single documents or batches from DataFrames
1818
- Export results to DataFrame, JSON, or tree visualizations
1919
- Configurable parsing rules and word limits
@@ -48,6 +48,10 @@ parsed_result.to_json("output.json")
4848
# Print tree visualization
4949
print(parsed_result.to_tree())
5050

51+
# Regenerate clean markdown from parsed structure
52+
regenerated_md = parsed_result.to_markdown()
53+
print(regenerated_md)
54+
5155
# View results in a pandas DataFrame
5256
df_parsed = parsed_result.to_dataframe()
5357
print(df_parsed)
@@ -117,11 +121,11 @@ df_parsed.to_csv("parsed_data.csv")
117121
```python
118122
import headhunter
119123

120-
# Document where headings are embedded inline or lack proper formatting
124+
# Document where headings are embedded inline, lack proper formatting or have typos
121125
messy_doc = """
122126
This document has ## Heading 1 embedded in text without line breaks.
123127
Then we have **heading 2** in bold but inline.
124-
**Inline Heading:** with content on the same line.
128+
**Inline Haedign:** with content on the same line.
125129
"""
126130

127131
# Specify expected headings to extract via fuzzy matching
@@ -137,7 +141,7 @@ print(parsed.metadata) # includes: matched_count, expected_count, match_percent
137141

138142
## How Hierarchy is Built
139143

140-
Headhunter recognizes different heading styles in Markdown and builds a hierarchical structure by assigning levels to each heading. The following rules govern this process:
144+
`headhunter` recognizes different heading styles in Markdown and builds a hierarchical structure by assigning levels to each heading. The following rules govern this process:
141145

142146
### Basic Principles
143147

@@ -199,7 +203,7 @@ Different heading styles can be mixed in the same document. When switching from
199203

200204
## Fuzzy Heading Matching
201205

202-
When documents have inconsistent formatting, such as headings embedded inline within text, missing markdown markers, or improper line breaks, headhunter can use fuzzy matching to extract expected headings.
206+
When documents have inconsistent formatting, such as headings embedded inline within text, missing markdown markers, or improper line breaks, `headhunter` can use fuzzy matching to extract expected headings.
203207

204208
**How it works:**
205209

@@ -217,3 +221,21 @@ Provide a list of `expected_headings` to `process_text()` or `process_batch_df()
217221
- 80-100: Strict matching, reduces false positives
218222
- 60-79: Moderate matching, allows more variation
219223
- Below 60: Lenient matching, may produce unexpected matches
224+
225+
## Markdown Regeneration
226+
227+
After parsing a document, `headhunter` can regenerate clean, standardized Markdown from the parsed structure. This is useful for:
228+
229+
- **Cleaning up messy documents**: Convert inconsistent formatting into standard Markdown
230+
- **Standardizing format**: Ensure all documents use the same heading style
231+
- **Post-processing extracted headings**: Apply fuzzy matching to extract headings, then export the cleaned result
232+
233+
### How Regeneration Works
234+
235+
The `to_markdown()` method converts the parsed hierarchical structure back into Markdown:
236+
237+
- **Standard headings**: Converted to hash format (`#`, `##`, `###`, etc.) based on hierarchical level
238+
- **Inline headings**: Preserved as bold format with colon (`**Heading:** content`)
239+
- **YAML front matter**: Metadata is included as YAML front matter at the top of the document
240+
- **Consistent spacing**: Single blank lines between sections for readability
241+
- **Case preservation**: Original text case is maintained (including ALL CAPS)

src/headhunter/models.py

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -287,6 +287,20 @@ def to_tree(self, show_line_numbers: bool = True, show_type: bool = True) -> str
287287
self.hierarchy, show_line_numbers, show_type, self.metadata
288288
)
289289

290+
def to_markdown(self) -> str:
291+
"""Regenerates clean Markdown from the parsed structure.
292+
293+
Converts the hierarchical structure back into properly formatted Markdown,
294+
using hash (#) syntax for standard headings and bold (**) format for inline
295+
colon headings. Includes YAML front matter if metadata exists.
296+
297+
Returns:
298+
Regenerated Markdown string.
299+
"""
300+
from headhunter import regenerate
301+
302+
return regenerate.to_markdown(self.hierarchy, self.metadata)
303+
290304
def to_dataframe(self) -> pd.DataFrame:
291305
"""Converts the document to a pandas DataFrame.
292306
@@ -451,6 +465,18 @@ def to_tree(
451465
self.documents, output_dir, show_line_numbers, show_type
452466
)
453467

468+
def to_markdown(self) -> dict[str, str]:
469+
"""Regenerates Markdown for all documents in the batch.
470+
471+
Returns:
472+
Dictionary mapping document IDs to their regenerated Markdown strings.
473+
"""
474+
result: dict[str, str] = {}
475+
for doc in self.documents:
476+
doc_id = str(doc.metadata["id"])
477+
result[doc_id] = doc.to_markdown()
478+
return result
479+
454480
def to_dataframe(self) -> pd.DataFrame:
455481
"""Combines all documents into a single pandas DataFrame.
456482

src/headhunter/regenerate.py

Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
"""Markdown regeneration from parsed hierarchical structures."""
2+
3+
from headhunter.models import HierarchyContext
4+
5+
6+
def to_markdown(
7+
hierarchy: list[HierarchyContext],
8+
metadata: dict[str, object] | None = None,
9+
) -> str:
10+
"""Regenerates Markdown from parsed hierarchical structure.
11+
12+
This function converts a parsed document structure back into clean, properly
13+
formatted Markdown. It processes the hierarchy linearly, converting headings
14+
and content blocks according to the following rules:
15+
16+
- YAML front matter is generated from metadata if provided
17+
- Standard headings use hash (#) format based on hierarchical level
18+
- Inline headings use bold (**text:**) format
19+
- Inline headings are merged with immediate child content on the same line
20+
- Content blocks are preserved as-is with single blank line spacing
21+
- Original text case is preserved (including ALL CAPS)
22+
23+
Args:
24+
hierarchy: List of HierarchyContext objects representing the document structure.
25+
metadata: Optional metadata dictionary to include as YAML front matter.
26+
27+
Returns:
28+
Regenerated Markdown string with YAML front matter (if metadata provided)
29+
and properly formatted headings and content.
30+
"""
31+
lines: list[str] = []
32+
33+
if metadata:
34+
lines.append("---")
35+
for key, value in metadata.items():
36+
lines.append(f"{key}: {value}")
37+
lines.append("---")
38+
lines.append("")
39+
40+
i = 0
41+
while i < len(hierarchy):
42+
ctx = hierarchy[i]
43+
token = ctx.token
44+
45+
if token.type == "heading":
46+
is_inline = token.metadata and token.metadata.is_inline
47+
has_next = i + 1 < len(hierarchy)
48+
next_is_content = (
49+
has_next
50+
and hierarchy[i + 1].token.type == "content"
51+
and hierarchy[i + 1].level == ctx.level + 1
52+
)
53+
54+
if is_inline and has_next and next_is_content:
55+
content = hierarchy[i + 1].token.content
56+
lines.append(f"**{token.content}:** {content}")
57+
lines.append("")
58+
i += 2
59+
elif is_inline:
60+
lines.append(f"**{token.content}:**")
61+
lines.append("")
62+
i += 1
63+
else:
64+
hash_count = min(ctx.level, 6)
65+
hashes = "#" * hash_count
66+
lines.append(f"{hashes} {token.content}")
67+
lines.append("")
68+
i += 1
69+
70+
else: # token.type == "content"
71+
lines.append(token.content)
72+
lines.append("")
73+
i += 1
74+
75+
return "\n".join(lines).rstrip() + "\n"

tests/conftest.py

Lines changed: 85 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -7,43 +7,89 @@
77
import pandas as pd
88
import pytest
99

10+
# Sample data fixtures for tests
11+
1012

1113
@pytest.fixture
1214
def sample_mixed_markdown() -> str:
1315
"""Sample markdown text with mixed heading styles for testing."""
14-
return (pathlib.Path(__file__).parent / "fixtures" / "mixed.md").read_text()
16+
return (
17+
pathlib.Path(__file__).parent / "fixtures" / "sample_data" / "mixed.md"
18+
).read_text()
19+
20+
21+
@pytest.fixture
22+
def sample_dataframe() -> pd.DataFrame:
23+
"""Sample DataFrame with markdown content for batch processing tests."""
24+
return pd.read_csv(
25+
pathlib.Path(__file__).parent / "fixtures" / "sample_data" / "sample_data.csv"
26+
)
27+
28+
29+
@pytest.fixture
30+
def sample_match_markdown() -> str:
31+
"""Sample markdown text for matcher testing."""
32+
return (
33+
pathlib.Path(__file__).parent / "fixtures" / "sample_data" / "match.md"
34+
).read_text()
35+
36+
37+
@pytest.fixture
38+
def sample_dataframe_match() -> pd.DataFrame:
39+
"""Sample DataFrame with markdown content for batch processing with matcher."""
40+
return pd.read_csv(
41+
pathlib.Path(__file__).parent
42+
/ "fixtures"
43+
/ "sample_data"
44+
/ "sample_data_match.csv"
45+
)
46+
47+
48+
# Expected output fixtures for tests
49+
## JSON outputs
1550

1651

1752
@pytest.fixture
1853
def sample_mixed_json() -> dict:
1954
"""Expected JSON output for mixed markdown fixture."""
20-
with open(pathlib.Path(__file__).parent / "fixtures" / "mixed.json") as f:
55+
with open(
56+
pathlib.Path(__file__).parent / "fixtures" / "expected_json" / "mixed.json"
57+
) as f:
2158
return json.load(f)
2259

2360

2461
@pytest.fixture
25-
def sample_match_markdown() -> str:
26-
"""Sample markdown text for matcher testing."""
27-
return (pathlib.Path(__file__).parent / "fixtures" / "match.md").read_text()
62+
def expected_json_files() -> dict[str, dict]:
63+
"""Expected JSON output files for batch processing tests."""
64+
json_dir = pathlib.Path(__file__).parent / "fixtures" / "expected_json"
65+
result = {}
66+
for json_file in sorted(json_dir.glob("doc*.json")):
67+
with open(json_file) as f:
68+
result[json_file.name] = json.load(f)
69+
return result
2870

2971

3072
@pytest.fixture
3173
def sample_match_json() -> dict:
3274
"""Expected JSON output for match markdown fixture."""
33-
with open(pathlib.Path(__file__).parent / "fixtures" / "match.json") as f:
75+
with open(
76+
pathlib.Path(__file__).parent / "fixtures" / "expected_json" / "match.json"
77+
) as f:
3478
return json.load(f)
3579

3680

37-
@pytest.fixture
38-
def sample_dataframe() -> pd.DataFrame:
39-
"""Sample DataFrame with markdown content for batch processing tests."""
40-
return pd.read_csv(pathlib.Path(__file__).parent / "fixtures" / "sample_data.csv")
81+
## CSV outputs
4182

4283

4384
@pytest.fixture
4485
def sample_dataframe_parsed() -> pd.DataFrame:
4586
"""Expected parsed output for sample_dataframe."""
46-
path = pathlib.Path(__file__).parent / "fixtures" / "sample_data_parsed.csv"
87+
path = (
88+
pathlib.Path(__file__).parent
89+
/ "fixtures"
90+
/ "expected_csv"
91+
/ "sample_data_parsed.csv"
92+
)
4793
df = pd.read_csv(path)
4894

4995
# Convert string representations of lists back to actual lists
@@ -53,18 +99,15 @@ def sample_dataframe_parsed() -> pd.DataFrame:
5399
return df
54100

55101

56-
@pytest.fixture
57-
def sample_dataframe_match() -> pd.DataFrame:
58-
"""Sample DataFrame with markdown content for batch processing with matcher."""
59-
return pd.read_csv(
60-
pathlib.Path(__file__).parent / "fixtures" / "sample_data_match.csv"
61-
)
62-
63-
64102
@pytest.fixture
65103
def sample_dataframe_match_parsed() -> pd.DataFrame:
66104
"""Expected parsed output for sample_dataframe_match with matcher."""
67-
path = pathlib.Path(__file__).parent / "fixtures" / "sample_data_match_parsed.csv"
105+
path = (
106+
pathlib.Path(__file__).parent
107+
/ "fixtures"
108+
/ "expected_csv"
109+
/ "sample_data_match_parsed.csv"
110+
)
68111
df = pd.read_csv(path)
69112

70113
# Convert string representations of lists back to actual lists
@@ -76,22 +119,34 @@ def sample_dataframe_match_parsed() -> pd.DataFrame:
76119
return df
77120

78121

79-
@pytest.fixture
80-
def expected_json_files() -> dict[str, dict]:
81-
"""Expected JSON output files for batch processing tests."""
82-
json_dir = pathlib.Path(__file__).parent / "fixtures" / "expected_json"
83-
result = {}
84-
for json_file in sorted(json_dir.glob("*.json")):
85-
with open(json_file) as f:
86-
result[json_file.name] = json.load(f)
87-
return result
122+
## Tree outputs
88123

89124

90125
@pytest.fixture
91126
def expected_tree_files() -> dict[str, str]:
92127
"""Expected tree output files for batch processing tests."""
93128
tree_dir = pathlib.Path(__file__).parent / "fixtures" / "expected_tree"
94129
result = {}
95-
for tree_file in sorted(tree_dir.glob("*.txt")):
130+
for tree_file in sorted(tree_dir.glob("doc*.txt")):
96131
result[tree_file.name] = tree_file.read_text()
97132
return result
133+
134+
135+
## Markdown outputs
136+
137+
138+
@pytest.fixture
139+
def expected_markdown_files() -> dict[str, str]:
140+
"""Expected markdown output files for batch processing tests."""
141+
md_dir = pathlib.Path(__file__).parent / "fixtures" / "expected_markdown"
142+
result = {}
143+
for md_file in sorted(md_dir.glob("doc*.md")):
144+
result[md_file.stem] = md_file.read_text()
145+
return result
146+
147+
148+
@pytest.fixture
149+
def expected_markdown_match() -> str:
150+
"""Expected markdown output for match.md fixture with matcher."""
151+
path = pathlib.Path(__file__).parent / "fixtures" / "expected_markdown" / "match.md"
152+
return path.read_text()
File renamed without changes.
File renamed without changes.
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
---
2+
row_index: 0
3+
id: doc1
4+
category: A
5+
priority: 1
6+
---
7+
8+
# Document 1
9+
10+
This is the first document with some content.
11+
12+
## Section 1.1
13+
14+
More details here.
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
---
2+
row_index: 1
3+
id: doc2
4+
category: B
5+
priority: 2
6+
---
7+
8+
# Document 2
9+
10+
## Overview
11+
12+
Second document overview.
13+
14+
### Details
15+
16+
Nested content.

0 commit comments

Comments
 (0)