-
Notifications
You must be signed in to change notification settings - Fork 38
Open
Description
I am trying to create yaml for this file. Below is the yaml structure I have created. I want to get the Full Name, and refernce code. I am able to get reference code but not the Full name.
# Use the pdfbox parser, since it's the same one we used to originally etract the text to build this planning document.
extractor: "pdf.pdfbox"
# All measurements are in points. 1 point = 1/72 of an inch.
# x-coordinates are from the left edge of the page.
# y-coordinates are from the top edge of the page.
header:
# ignore anything less than this many points from the top, default and per-page
default: 690
footer:
# ignore anything less than this many points from the bottom, default and per-page
default: 7160
# Text segments are generally parsed in order, top to bottom, left to right.
# If two text segments have y-coordinates within this many points, consider them on the same line,
# and process the one further left first, even if it is 0.4pt lower on the page.
maxRowDistance: 4
# Define the output data record.
# Since the main record type we're collecting information on is our employees,
# we'll have that be the root type for our harvested information.
rootRecordType: RAF
recordTypes:
RAF:
label: "RAF" # Labels are used when nested recordTypes come into play, like this document.
valueTypes:
# Not sure what to name a valueType? Just make something up!
- URC
- Name
valueTypes:
URC:
# In the CSV, use "Employee ID" as the column header instead of "employee".
label: "Unique Reference Code"
Name:
label: "Full Name"
# Now we define the finite-state machine
# Let's name the state that our machine starts off with:
initialState: "INIT"
# When each text segment is encountered, each transition for the current state is checked.
states:
INIT:
include: false
transitions:
- condition: URC
nextState: URC
- condition: any
nextState: INIT
URC:
startRecord: true
transitions:
- condition: any
nextState: Name
Name:
include: true
transitions:
- condition: Name
nextState: Name
- conidtion: any
nextState: INIT
# Here we define the conditions:
conditions:
# An example of comparing text with regex.
# In this case, we're making sure that the text contains the characters 'ID-' followed by any amount of numbers.
URC: 'text =~ /\b[a-f0-9]{32}\b/'
Name: 'text =~ /^[A-Z][a-z]+(?: [A-Z][a-z]+)* [A-Z][a-z]+$/'
# Need a condition that is always true? "1=1" does that for you.
any: "1 = 1"
Metadata
Metadata
Assignees
Labels
No labels
