All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- Phase A-1: Image embedding pipeline — file path and data URI images
are read, base64-decoded or loaded from disk, and embedded as
BinData/entries in the HWPX ZIP;binaryItemIDRefreferences are emitted on<hp:pic>elements; HTTP/HTTPS URLs are preserved as external references. - Phase A-2: List writer with bullet/numbering —
<hh:numbering>definitions (BULLET id=1, DIGIT id=2) emitted in header.xml;numPrIDRefset on list-item paragraphs; nested list support viawrite_list_items()with depth-basedparaPrIDRef(id=2 depth-0, id=3 depth-1+); 23 new list tests. - Phase A-3: BlockQuote visual indentation —
paraPr id=1with left margin 800 HWP units (~20mm);quote_depthparameter threaded throughwrite_block(); 6 new blockquote tests. - Phase A-4: Footnote OWPML structure —
<hp:fn noteId>wrapping around footnote content blocks; 7 new footnote tests. - Phase B-1: HWPX nested list reader —
StagedBlockenum for list-paragraph grouping;group_list_paragraphscollapses flat sequences into nestedBlock::List;paraPrIDRef/numPrIDRefparsing from<hp:p>; 13 new reader list tests. - Phase B-3: HWPX lenient XML error recovery — malformed section XML parsing continues with partial results; missing section files skipped with warning; missing attributes use defaults.
- Phase C-3: Code block language preservation — language hint stored
as
<!-- hwp2md:lang:X -->XML comment in HWPX; reader parses it back; MD→HWPX→MD roundtrip preserves language info;-->injection sanitized. - Phase B-2: HWP binary list recognition — two-tier detection:
numbering_idfrom ParaShape records, then text-prefix heuristics (●■▶•-* bullets, 1./2)/a. ordered); year-like prefix rejection;StagedBlockgrouping; 32 new HWP list tests. - Phase B-4: Page layout parsing —
PageLayoutIR struct (width, height, landscape, margins); HWPX reader parses<hp:secPr>→<hp:pagePr>→<hp:pageSize>/<hp:margin>; writer emits<hp:secPr>with A4 portrait defaults; 11 new page layout tests. - Phase C-1: Heading paraPr (id=4) with 180% line spacing;
ParaPrConfigstruct replacing bare(id, left_margin)parameters; 14 new paraPr tests. - Phase D-1: Comprehensive roundtrip integration tests — 37 tests covering all block types (paragraph, H1-H6, ordered/unordered/nested list, table, code block, horizontal rule, blockquote, image, footnote, inline formatting, combined document).
paraPrtable expanded from 2 to 5 entries (normal, blockquote, list-depth-0, list-depth-1+, heading).writer_tests_roundtrip.rssplit: golden tests →writer_tests_golden.rs, code language tests →writer_tests_code_lang.rs.ir::Sectionnow carriespage_layout: Option<PageLayout>.- Headings use
paraPrIDRef="4"(180% line spacing) outside blockquotes.
- Image filename collision: counter suffix dedup (
photo_2.png) instead of silent drop;unique_entry_namebounded to 10,000 iterations. - XML comment injection: code language
-->sanitized via--collapse. flush_paragraphmarked#[cfg(test)]instead of#[allow(dead_code)].
- Phase 16: Golden file test (
golden_comprehensive_document_structure): validates internal ZIP XML structure (section0.xml, header.xml, content.hpf, mimetype) of generated HWPX archives; OWPML schema validation re-verified with polaris DVC after inline charPr changes. - Phase 15:
faceNameIDRefattribute emitted on section-level inline<hp:charPr>when the inline carries afont_name, completing the font name write-to-read roundtrip; 4 new font name roundtrip tests. - Phase 14: Section-level inline
<hp:charPr>emission inwrite_inline_charpr(): bold, italic, underline, strikeout, superscript, subscript, and color attributes are now written inside<hp:run>elements for OWPML conformance; 16 new bold/italic/underline/strike/color roundtrip tests; 3 newxml_escape_contenttests (apostrophe, all special chars, passthrough). - Phase 13: Font name reader:
parse_face_names()from header.xml<hh:fontface>entries;faceNameIDRef/hangulIDRefresolution inapply_charpr_attrs;with_font_name()builder method onInline; 9 new font resolution tests; README updated with Phase 9-12 features (hyperlinks, ruby, footnote_ref, inline code, metadata). - Phase 12:
writer_tests.rssplit into 8 topic-based test modules (writer_tests_charpr,writer_tests_section,writer_tests_metadata,writer_tests_hyperlink,writer_tests_ruby,writer_tests_footnote,writer_tests_roundtrip); ruby + hyperlink combination test added. xml_escape_contentnow covers the complete set of XML 1.0 predefined entities (&,<,>,",').
..Default::default()struct update syntax eliminated across the entire codebase (md/parser.rs,hwp/convert.rs) — all fields are now set explicitly.
trim_start_matches('#')replaced withstrip_prefix('#')in color attribute emission to prevent stripping multiple#characters.- Font name propagation in flush paths (
flush_paragraph,flush_cell_paragraph,flush_list_item_paragraph,flush_footnote_paragraph) now chains.with_font_name(). - Replaced
.unwrap()withif letpattern inwrite_inlinesto prevent potential panic on link URL access. - Empty text inline run guard in
writer_section.rs: zero-length text runs are skipped, preventing emission of empty<hp:t/>elements. - Dead code removed:
InlineStyle.codefield (superseded by theCharPrKeypath introduced in Phase 8). - Broken intra-doc link in
md/mod.rscorrected.
- Phase 11: Crate-level
//!documentation onlib.rs;///doc comments on all public types inir.rs,error.rs, and all public functions inconvert.rs;#![warn(missing_docs)]lint enabled;xml_escape_contentextended with"→"escaping for defense-in-depth. - Phase 10: Ruby annotation writer (
hp:ruby/baseText/rubyText);footnote_refwriter emittinghp:noteRef;Inlinebuilder pattern withwith_formatting/with_link/with_rubyconstructors; ruby formatting propagation fix ensuring annotation runs inherit the base run's charPr. - Phase 9: Superscript/subscript writer using the
supscriptcharPr attribute; hyperlink reader and writer usingfieldBegin/fieldEndcontrols; writer module split into three focused submodules; reader run-start reset fix preventing stale formatting from leaking across paragraphs. - Phase 3: charPr / paraPr / fontface reference tables in
header.xmlwith IDRef linking between section paragraph runs and the header table entries. - Phase 4: Style table (
hh:styles) with Normal + Heading1-6, numericstyleIDRefandcharPrIDRefvalues replacing string-form references for OWPML schema compliance. - Phase 5: Sequential paragraph IDs (
idattribute on<hp:p>), table block wrapping in<hp:p>/<hp:run>/<hp:tbl>hierarchy, heading-specific charPr entries with level-differentiated font heights. - Phase 6: OWPML schema validation pass (
enable_schema=true) with polaris DVC; fixedbreakSettingattribute set,alignhorizontal/vertical attrs,marginchild elements,heading/border/autoSpacing/lineSpacingrequired children inparaPr. - Phase 7:
hh:borderFillstable with default entry (id=1),slash/backSlash/ border / diagonal children;borderFillIDRefon everycharPrentry; polaris_dvc rev pinning; schema validation expansion covering all writer-emitted elements. - Phase 8: Inline code (
code: true) mapped to distinct charPr entry with Courier New monospace font; metadata preservation (hp:docInfowith title and author incontent.hpf); HWPX structural roundtrip tests; dead code audit; version bump to 0.3.0.
CharPrKeystruct gains acodefield;from_inline()now forcesCourier Newfont for inline code spans, producing a distinct charPr ID.generate_content_hpf()emits<hp:docInfo>when the document carries title or author metadata.- README library usage example updated to
hwp2md = "0.3".
- Font registration for inline code spans: the monospace font is now registered
from the resolved
CharPrKey(which overrides toCourier Newfor code) rather than from the raw IR inline'sfont_namefield, ensuring the font table entry always exists when referenced by a charPr.
- Ruby text control parsing (
rubyctrl_id): base text and phonetic annotation are extracted and emitted as<ruby>HTML in Markdown output. - Lenient CFB fallback: corrupted or partially-written HWP files are now partially recovered instead of returning a hard error. Successfully read sections are returned with a warning.
- Distributed-document decryption: HWP files with the
distributedflag set are now decrypted with AES-128 ECB before parsing. - EQEDIT → LaTeX converter:
HWPTAG_EQEDITblocks are converted to fenced$$display-math blocks in Markdown output. - Image embedding:
gso(GShapeObject) controls are resolved to theirBinDatastreams and written asinline Markdown. - Footnote and endnote parsing:
fnandencontrols are extracted and rendered as Markdown footnote references with collected definitions. - Hyperlink extraction:
hylncontrols produce[text](url)Markdown links with URL sanitisation (scheme allow-list:http,https,mailto). - Superscript and subscript character shapes are now mapped to
<sup>/<sub>. - Heading type detection from
HWPTAG_PARA_SHAPEoutlineLv attribute. - HWPX colspan and rowspan extraction for table cells.
- Table parser now enforces a row-count cap (4 096) to prevent allocation DoS.
- Decompression-bomb guard: deflate output is capped at 256 MB; exceeding this
limit returns
Hwp2MdError::DecompressionBomb. - CFB stream reads are capped at 256 MB (
MAX_CFB_STREAM). - GitHub Actions CI workflow (
cargo test,cargo clippy -- -D warnings). - 614 unit and integration tests (82%+ line coverage).
- 6 CRITICAL security issues resolved in the initial rewrite phase, including unbounded allocation on untrusted record sizes and integer overflow in dimension calculations.
- Infinite-loop in EQEDIT tokeniser on malformed input (recursion depth guard).
- Offset clamp in
read_utf16le_strto prevent out-of-bounds reads. - Heading level clamped to 1–6 (H7+ demoted to H6).
- URL deduplication: identical hyperlinks within the same paragraph are collapsed to a single link.
parse_recordsextended-size field handling (0xFFF sentinel + follow-up u32).- Zlib fallback path for deflate-wrapped CFB streams.
- Dead
CTRL_EQUATION,CTRL_HEADER,CTRL_FOOTERconstants removed.
- Replaced all HWP-specific third-party crate dependencies with a
self-contained parser based on the
cfb(OLE2) andzip(HWPX) crates. control.rssplit into category modules (table,image,hyperlink,ruby,dispatcher,common) for maintainability.reader.rssplit into focused submodules.tracing::debug!demoted totracing::trace!for high-frequency paths (table dims, image shapes, unhandled ctrl_id, list-item hints).MAX_CFB_STREAMdeduplicated: defined once aspub(crate)inreader.rsand imported insummary.rs.- Error types converted to
thiserror-derived enum (Hwp2MdError) with structured variants instead of stringly-typed errors. - Markdown inline escaping hardened: characters that start a Markdown construct at line start are escaped.
- Initial project scaffolding with HWP 5.0 (CFB-based) and HWPX (ZIP/XML) parsing skeletons.
- Basic paragraph text extraction.
- CLI binary (
hwp2md) with--input,--output, and--styleflags. - Bidirectional conversion:
HWP → MarkdownandHWPX → Markdown. - Markdown → IR → Markdown roundtrip support.