Skip to content

Does this tool support extraction of data from complex PDF structure which contains incomplete boxes? #62

@ZarvisD

Description

@ZarvisD

image

I am trying to create yaml for this file. Below is the yaml structure I have created. I want to get the Full Name, and refernce code. I am able to get reference code but not the Full name.

# Use the pdfbox parser, since it's the same one we used to originally etract the text to build this planning document.
extractor: "pdf.pdfbox"

# All measurements are in points. 1 point = 1/72 of an inch.
# x-coordinates are from the left edge of the page.
# y-coordinates are from the top edge of the page.
header:
    # ignore anything less than this many points from the top, default and per-page
  default: 690
footer:
    # ignore anything less than this many points from the bottom, default and per-page
  default: 7160

# Text segments are generally parsed in order, top to bottom, left to right.
# If two text segments have y-coordinates within this many points, consider them on the same line,
# and process the one further left first, even if it is 0.4pt lower on the page.
maxRowDistance: 4

# Define the output data record.
# Since the main record type we're collecting information on is our employees,
# we'll have that be the root type for our harvested information.
rootRecordType: RAF
recordTypes:
  RAF:
    label: "RAF" # Labels are used when nested recordTypes come into play, like this document.
    valueTypes:
      # Not sure what to name a valueType? Just make something up!
      - URC
      - Name

valueTypes:
  URC:
    # In the CSV, use "Employee ID" as the column header instead of "employee".
    label: "Unique Reference Code"
  Name:
    label: "Full Name"

# Now we define the finite-state machine
# Let's name the state that our machine starts off with:
initialState: "INIT"

# When each text segment is encountered, each transition for the current state is checked.
states:
  INIT:
    include: false
    transitions:
      - condition: URC
        nextState: URC

      - condition: any
        nextState: INIT

  URC:
    startRecord: true
    transitions:
      - condition: any  
        nextState: Name

  Name:
    include: true
    transitions:
      - condition: Name
        nextState: Name

      - conidtion: any
        nextState: INIT


# Here we define the conditions:
conditions:

  # An example of comparing text with regex.
  # In this case, we're making sure that the text contains the characters 'ID-' followed by any amount of numbers.
  URC: 'text =~ /\b[a-f0-9]{32}\b/'

  Name: 'text =~ /^[A-Z][a-z]+(?: [A-Z][a-z]+)* [A-Z][a-z]+$/'

  # Need a condition that is always true? "1=1" does that for you.
  any: "1 = 1"

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions