Skip to content

Commit 49079f6

Browse files
authored
Find index of right-to-work label when parsing PDF (#89)
* Find the index of right to work ques. instead of assuming index * Handle cases when impossible to get the information - e.g. corrupted PDF - weird text joining in PDF * Extract static text into variable
1 parent f06c581 commit 49079f6

File tree

1 file changed

+10
-3
lines changed

1 file changed

+10
-3
lines changed

src/shortlister/model.py

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -262,7 +262,14 @@ def extract_info_from_text(lines: List[str]):
262262

263263
# removes header/footer and other irrelevant info
264264
applicant_info = lines[1:-5]
265-
right_to_work = lines[-5:-1]
265+
266+
right_to_work_text = "Do you have the unrestricted right to work in the UK?"
267+
268+
try:
269+
right_to_work_index = lines.index(right_to_work_text)
270+
right_to_work = lines[right_to_work_index:(right_to_work_index + 4)]
271+
except ValueError:
272+
right_to_work = []
266273

267274
# filter out the field name and retain only the info to applicant
268275
for label in labels:
@@ -279,8 +286,8 @@ def extract_info_from_text(lines: List[str]):
279286
# finds where the question is and checks the next index which contains the answer to the question
280287
applicant_right_to_work = None
281288
visa_req_text = None
282-
if "Do you have the unrestricted right to work in the UK?" in right_to_work:
283-
i = right_to_work.index("Do you have the unrestricted right to work in the UK?")
289+
if right_to_work_text in right_to_work:
290+
i = right_to_work.index(right_to_work_text)
284291
if right_to_work[i + 1] == "No":
285292
j = right_to_work.index(
286293
"If no, please give details of your VISA requirements"

0 commit comments

Comments
 (0)