Skip to content

Commit 7dffdd2

Browse files
committed
Improve documentation and refactor output handling
- Added detailed documentation on how hierarchy is built in README.md. - Renamed writer module to output. - Refactored ParserText.to_dataframe method to output pandas DataFrame instead of list of dicts. - Updated models.py to use the new output functions. - Created test fixtures for mixed markdown and sample data for testing. - Added tests for API entry points to validate processing of markdown and DataFrame.
1 parent 6c6f050 commit 7dffdd2

9 files changed

Lines changed: 408 additions & 96 deletions

File tree

README.md

Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -110,3 +110,65 @@ parsed_batch.to_tree("tree_outputs/")
110110
# Export parsed data to a CSV file
111111
df_parsed.to_csv("parsed_data.csv")
112112
```
113+
114+
## How Hierarchy is Built
115+
116+
Headhunter recognizes different heading styles in Markdown and builds a hierarchical structure by assigning levels to each heading. The following rules govern this process:
117+
118+
### Basic Principles
119+
120+
- **Headings create structure**: Each heading creates a new section in the document's outline
121+
- **Content follows headings**: Regular text is always nested under its nearest heading above
122+
- **First heading starts at level 1**: The first heading in a document becomes the top level
123+
124+
### Rules for Different Heading Types
125+
126+
#### Hash Headings (`#`, `##`, `###`)
127+
128+
These work as expected in standard Markdown:
129+
130+
- More `#` symbols = deeper in the hierarchy
131+
- `# Title` → level 1
132+
- `## Subtitle` → level 2
133+
- `### Sub-subtitle` → level 3
134+
135+
The level increases or decreases based on how many more or fewer `#` symbols are present compared to the previous hash heading.
136+
137+
#### Bold and Italic Headings (`**text**`, `*text*`, `***text***`)
138+
139+
These follow a specific hierarchy from highest to lowest:
140+
141+
1. `**Bold text**` (2 asterisks) = highest level
142+
2. `***Bold and italic***` (3 asterisks) = middle level
143+
3. `*Italic text*` (1 asterisk) = lowest level
144+
145+
When switching between these styles, the level adjusts by just one step up or down:
146+
147+
- Going from bold (`**`) to italic (`*`) moves one level deeper
148+
- Going from italic (`*`) to bold (`**`) moves one level shallower
149+
- Using the same style consecutively keeps the same level
150+
151+
#### ALL CAPS HEADINGS
152+
153+
When a heading with hash (`#`) or asterisk (`**`) markers is written in ALL CAPITAL LETTERS, special rules apply:
154+
155+
- The first ALL CAPS heading sets its level based on what came before it
156+
- Every subsequent ALL CAPS heading uses that same level (they are treated as peers)
157+
158+
Examples:
159+
160+
- `# ALL CAPS HEADING` - Valid heading (hash marker with ALL CAPS text)
161+
- `**ALL CAPS HEADING**` - Valid heading (asterisk marker with ALL CAPS text)
162+
- `ALL CAPS HEADING` - Not a heading (no marker, treated as content)
163+
164+
#### Inline Headings (with colons)
165+
166+
When a heading ends with a colon (like `**Name:** Jane Doe`), it works differently:
167+
168+
- The heading itself goes one level deeper than the previous heading
169+
- The content immediately after it is always treated as the deepest level
170+
- After that content, we return to the normal hierarchy
171+
172+
### Mixed Heading Styles
173+
174+
Different heading styles can be mixed in the same document. When switching from one style to another, the new heading typically goes one level deeper than the previous one. However, the specific rules for each style (described above) still apply.

pyproject.toml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,7 @@ license = "LGPL-2.1-only"
1010
readme = "README.md"
1111
requires-python = ">=3.12"
1212
dependencies = [
13-
"pandas>=2.3.3",
14-
"pyarrow>=22.0.0"
13+
"pandas>=2.3.3"
1514
]
1615

1716
[dependency-groups]
@@ -26,7 +25,8 @@ dev = [
2625
docs = ["pdoc>=15.0.0"]
2726
notebooks = [
2827
"jupyter>=1.1.1",
29-
"ipykernel>=6.29.5"
28+
"ipykernel>=6.29.5",
29+
"pyarrow>=22.0.0"
3030
]
3131

3232
[tool.pytest.ini_options]

src/headhunter/models.py

Lines changed: 18 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -137,9 +137,9 @@ def to_dict(self) -> dict[str, object]:
137137
Returns:
138138
A nested dictionary representation of the document structure.
139139
"""
140-
from headhunter import writer
140+
from headhunter import output
141141

142-
return writer.to_dict(self.hierarchy, self.metadata)
142+
return output.to_dict(self.hierarchy, self.metadata)
143143

144144
def to_json(self, filepath: str, indent: int = 2) -> str:
145145
"""Exports the document to a JSON file.
@@ -151,9 +151,9 @@ def to_json(self, filepath: str, indent: int = 2) -> str:
151151
Returns:
152152
Path to the created file.
153153
"""
154-
from headhunter import writer
154+
from headhunter import output
155155

156-
return writer.to_json_file(self.hierarchy, filepath, self.metadata, indent)
156+
return output.to_json_file(self.hierarchy, filepath, self.metadata, indent)
157157

158158
def to_tree(self, show_line_numbers: bool = True, show_type: bool = True) -> str:
159159
"""Generates an ASCII tree visualization of the document structure.
@@ -165,25 +165,23 @@ def to_tree(self, show_line_numbers: bool = True, show_type: bool = True) -> str
165165
Returns:
166166
ASCII tree representation as a string.
167167
"""
168-
from headhunter import writer
168+
from headhunter import output
169169

170-
# Build metadata heading from document metadata
171-
metadata_heading = dict(self.metadata) if self.metadata else None
172-
return writer.to_tree_string(
173-
self.hierarchy, show_line_numbers, show_type, metadata_heading
170+
return output.to_tree_string(
171+
self.hierarchy, show_line_numbers, show_type, self.metadata
174172
)
175173

176-
def to_dataframe(self) -> list[dict[str, object]]:
177-
"""Converts the document to row dictionaries.
174+
def to_dataframe(self) -> pd.DataFrame:
175+
"""Converts the document to a pandas DataFrame.
178176
179177
Returns:
180-
List of dictionaries representing content rows with
178+
DataFrame where each row is a content token with
181179
hierarchical context.
182180
"""
183-
from headhunter import writer
181+
from headhunter import output
184182

185183
doc_id = str(self.metadata["id"])
186-
return writer.to_dataframe_rows(self.hierarchy, doc_id, self.metadata)
184+
return output.to_dataframe(self.hierarchy, doc_id, self.metadata)
187185

188186

189187
@dataclasses.dataclass(frozen=True)
@@ -246,9 +244,9 @@ def to_json(self, output_dir: str, indent: int = 2) -> list[str]:
246244
Returns:
247245
List of created file paths.
248246
"""
249-
from headhunter import writer
247+
from headhunter import output
250248

251-
return writer.batch_to_json_files(self.documents, output_dir, indent)
249+
return output.batch_to_json_files(self.documents, output_dir, indent)
252250

253251
def to_tree(
254252
self, output_dir: str, show_line_numbers: bool = True, show_type: bool = True
@@ -263,9 +261,9 @@ def to_tree(
263261
Returns:
264262
List of created file paths.
265263
"""
266-
from headhunter import writer
264+
from headhunter import output
267265

268-
return writer.batch_to_tree_files(
266+
return output.batch_to_tree_files(
269267
self.documents, output_dir, show_line_numbers, show_type
270268
)
271269

@@ -276,9 +274,9 @@ def to_dataframe(self) -> pd.DataFrame:
276274
DataFrame with all content rows from all documents,
277275
including any metadata columns specified during batch processing.
278276
"""
279-
from headhunter import writer
277+
from headhunter import output
280278

281-
return writer.batch_to_dataframe(self.documents, self.metadata_columns)
279+
return output.batch_to_dataframe(self.documents, self.metadata_columns)
282280

283281

284282
class ParsingError(Exception):

0 commit comments

Comments
 (0)