You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: use lxml instead of bs4 to parse hOCR data (#3960)
- `lxml` is a much faster library than `bs4` when the input data is
regular
- since the hOCR data is guaranteed to be regular (programmatically
generated) we don't need `bs4` here to parse the data
- `lxml` improves parsing speed by about 10x
Example runtime profiling locally using the same `hocr` data from 1 page
pdf, where `agent.hocr_to_dataframe_bs4` is the current method on main
and `agent.hocr_to_dataframe` is the PR's method.

Copy file name to clipboardExpand all lines: CHANGELOG.md
+3-2
Original file line number
Diff line number
Diff line change
@@ -1,9 +1,10 @@
1
-
## 0.17.1-dev0
1
+
## 0.17.1-dev1
2
2
3
3
### Enhancements
4
4
5
5
-**Add image_url of images in html partitioner**`<img>` tags with non-data content include a new image_url metadata field with the content of the src attribute.
6
-
6
+
-**Use `lxml` instead of `bs4` to parse hOCR data.**`lxml` is much faster than `bs4` given the hOCR data format is regular (garanteed because it is programatically generated)
0 commit comments