How to extract the text from the LayoutItem objects when we set extract_layout=True in the parser? #695
Unanswered
michelle-unia-mermich
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
To parse a research paper that spans 2 columns and have images and tables at different positions on the page, I have now connected to your code API and used the
premium_mode
:If I do not use
extract_layout=True
and parse as normal, the parsed text is not accurate because the caption text is mixed up with the actual paragraph text in the parsed document. The reading order is also not accurate, for example, if the page has (A) a bottom left column text box and (B) a top right column text box, humans will read (A) first and then (B) according to standard, but the parser reads (B) first and then (A), in my attempts.To make sure that the reading order is correct and the final document only has the words of section headers and paragraph text without caption text, I do:
extract_layout=True
extract_layout=True
, we can have a list ofLayoutItem
objects from the page. EachLayoutItem
object is labelled into different categories, including:and finally,
text
orsectionHeader
text
andsectionHeader
LayoutItem objects according to x,y coordinates inbbox
attributes ofLayoutItem
The
LayoutItem
object only has those attributes:The only information I can get from this
LayoutItem
is the image queried using GET requests. How do I get the text within eachLayoutItem
object that is labeled as "text" or "sectionHeader"? I can pass this image through an OCR reader or the parser again, but that just seems expensive and wasteful, since the LlamaParse parser has already gone through those words once; it's just that I cannot associate eachLayoutItem
image to a text block in the final parsed text document.I also tried to use the
bbox
attribute of LayoutItem to identify the text section in the parsed document - by using thePageItem
object.for example, we have
Each
PageItem
object has bbox and text value attribute, but the bbox does not match any in the LayoutItem lists. Basically, thePageItem
objects that are recognised from each page are different from theLayoutItem
objects from each page, and the recognition/classification ofPageItem
objects is no where as good asLayoutItem
. For example, if a page has 19LayoutItem
objects, it only has 9PageItem
objects; and the text in aPageItem
object may combine all text ofcaption
andtext
LayoutItem
objects together, with the same wrong reading order in the original document.Is there a way to retrieve the words from each object of LayoutItem without using another OCR or parsing those images for the second time?
I would really appreciate your help! Please let me know if I need to provide any more details/documents.
Beta Was this translation helpful? Give feedback.
All reactions