Skip to content

scipdf.parse_pdf_to_dict() unable to capture author correctly #32

@tyoon10

Description

@tyoon10

Processing an academic paper PDF through scipdf.parse_pdf_to_dict(pdf_path) returns values for the 'author' key, the list of all the names that appear in the paper, including not only the actual authors but also all the names that appear in contents and references.

Actual:
{
"title": title,
"authors": "William T Shaw; M Abramowitz; I A Stegun; R A Bagnold; O E Barndorff-Nielsen; E Eberlein; E ; U Keller; K Fergusson; E Platen; Warren Gilchrist; G W Hill; A W Davis; D B Madan; E Seneta; Wikipedia; On; W T Shaw; W T Shaw; I R C Buckley; G Steinbrecher; W T Shaw; Quantile Mechanics; Y Xiong",
"pub_date": "2009-02-27",
... }

Correct:
{
"title": title,
"authors": "William T Shaw",
"pub_date": "2009-02-27",
... }

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions