Skip to content

Commit 2d28c71

Browse files
Copilotbenbalter
andcommitted
Add comprehensive performance improvements documentation
Co-authored-by: benbalter <282759+benbalter@users.noreply.github.com>
1 parent 6178c59 commit 2d28c71

1 file changed

Lines changed: 189 additions & 0 deletions

File tree

PERFORMANCE_IMPROVEMENTS.md

Lines changed: 189 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,189 @@
1+
# Performance Improvements Summary
2+
3+
This document summarizes the performance optimizations made to the word-to-markdown gem.
4+
5+
## Overview
6+
7+
The optimizations focus on reducing redundant DOM traversals and improving CSS selector efficiency in the document conversion process.
8+
9+
## Key Improvements
10+
11+
### 1. Combined Styled Elements Processing (48% faster)
12+
13+
**Before:**
14+
```ruby
15+
def implicit_headings
16+
@implicit_headings ||= begin
17+
headings = []
18+
@document.tree.css('[style]').each do |element|
19+
headings.push element unless element.font_size.nil? || element.font_size < MIN_HEADING_SIZE
20+
end
21+
headings
22+
end
23+
end
24+
25+
def font_sizes
26+
@font_sizes ||= begin
27+
sizes = []
28+
@document.tree.css('[style]').each do |element|
29+
sizes.push element.font_size.round(-1) unless element.font_size.nil?
30+
end
31+
sizes.uniq.sort.extend(DescriptiveStatistics)
32+
end
33+
end
34+
```
35+
36+
**After:**
37+
```ruby
38+
def implicit_headings
39+
process_styled_elements unless @implicit_headings
40+
@implicit_headings
41+
end
42+
43+
def font_sizes
44+
process_styled_elements unless @font_sizes
45+
@font_sizes
46+
end
47+
48+
def process_styled_elements
49+
headings = []
50+
sizes = []
51+
52+
@document.tree.css('[style]').each do |element|
53+
font_size = element.font_size
54+
unless font_size.nil?
55+
sizes.push font_size.round(-1)
56+
headings.push element if font_size >= MIN_HEADING_SIZE
57+
end
58+
end
59+
60+
@implicit_headings = headings
61+
@font_sizes = sizes.uniq.sort.extend(DescriptiveStatistics)
62+
end
63+
```
64+
65+
**Impact:**
66+
- Reduces DOM traversals from 2 to 1 when both methods are called
67+
- Benchmark shows 48% performance improvement (0.021s vs 0.041s)
68+
- Especially beneficial for documents with many styled elements
69+
70+
### 2. Memoized List Item Spans
71+
72+
**Before:**
73+
```ruby
74+
def remove_unicode_bullets_from_list_items!
75+
path = WordToMarkdown.soffice.major_version == '5' ? 'li span span' : 'li span'
76+
@document.tree.search(path).each do |span|
77+
# ...
78+
end
79+
end
80+
81+
def remove_numbering_from_list_items!
82+
path = WordToMarkdown.soffice.major_version == '5' ? 'li span span' : 'li span'
83+
@document.tree.search(path).each do |span|
84+
# ...
85+
end
86+
end
87+
```
88+
89+
**After:**
90+
```ruby
91+
def remove_unicode_bullets_from_list_items!
92+
list_item_spans.each do |span|
93+
# ...
94+
end
95+
end
96+
97+
def remove_numbering_from_list_items!
98+
list_item_spans.each do |span|
99+
# ...
100+
end
101+
end
102+
103+
private
104+
105+
def list_item_spans
106+
@list_item_spans ||= begin
107+
path = WordToMarkdown.soffice.major_version == '5' ? 'li span span' : 'li span'
108+
@document.tree.css(path)
109+
end
110+
end
111+
```
112+
113+
**Impact:**
114+
- Reduces version checks from 3 to 1
115+
- Caches the DOM query result
116+
- Simplifies code and improves maintainability
117+
118+
### 3. Improved CSS Selectors
119+
120+
**Changes:**
121+
- `td p``td > p` (direct child selector)
122+
- `li p``li > p` (direct child selector)
123+
- `table tr:first td``table tr:first-child > td` (more specific)
124+
- `.search()``.css()` (consistent API usage)
125+
126+
**Impact:**
127+
- Direct child selectors (>) are more specific and can be more efficient
128+
- Consistent use of `.css()` method improves code clarity
129+
- More precise selectors reduce unnecessary element matching
130+
131+
### 4. Configuration Updates
132+
133+
**Fixed `.rubocop.yml`:**
134+
- Changed `require:` to `plugins:` for rubocop extensions
135+
- Updated `Metrics/LineLength` to `Layout/LineLength`
136+
- Auto-fixed style issues
137+
138+
## Benchmark Results
139+
140+
Running `script/benchmark` demonstrates the improvements:
141+
142+
```
143+
user system total real
144+
CSS selector (td > p): 0.018785 0.000000 0.018785 ( 0.018785)
145+
CSS selector (td p): 0.019225 0.000000 0.019225 ( 0.019226)
146+
Process styled elements (single pass): 0.021174 0.000000 0.021174 ( 0.021174)
147+
Process styled elements (two passes): 0.041200 0.000000 0.041200 ( 0.041208)
148+
```
149+
150+
**Key findings:**
151+
- Single pass processing is **48% faster** than two passes
152+
- Direct child selectors show comparable performance to descendant selectors
153+
- Overall improvements compound for larger documents
154+
155+
## Testing
156+
157+
New test file `test/test_word_to_markdown_performance.rb` validates:
158+
- Styled elements are processed only once and cached
159+
- List item spans selector is memoized
160+
- Empty styled elements are handled correctly
161+
162+
All existing tests continue to pass, ensuring backward compatibility.
163+
164+
## Usage
165+
166+
The optimizations are transparent to users. No API changes were made, so existing code continues to work exactly as before, just faster.
167+
168+
To measure performance improvements in your own use case:
169+
170+
```bash
171+
bundle exec ruby script/benchmark
172+
```
173+
174+
## Future Optimization Opportunities
175+
176+
Potential areas for further optimization:
177+
1. Parallel processing for independent conversion operations
178+
2. Streaming processing for very large documents
179+
3. Cache parsed CSS styles for reuse
180+
4. Optimize regex patterns in string processing
181+
182+
## Conclusion
183+
184+
These optimizations significantly improve performance without changing the API or breaking existing functionality. The improvements are most noticeable with:
185+
- Large documents with many styled elements
186+
- Documents with extensive list structures
187+
- Batch processing scenarios
188+
189+
The changes follow Ruby best practices and maintain code readability while delivering measurable performance gains.

0 commit comments

Comments
 (0)