Skip to content

Missing title_* #14

@cverluise

Description

@cverluise

Around 10% of the npl_publn in the beta version have neither title_j nor title_m nor title_main_a. Most of the time, part of these elements are wrongly parsed the title_main_m.

How to reproduce the behaviour

Details
SELECT
  *
FROM (
  SELECT
    *
  FROM
    `npl-parsing.patcit.beta`
  WHERE
    title_j is NULL
    AND title_m is NULL
    AND title_main_a is NULL
    ) 
    AS parsing
JOIN (
  SELECT
    npl_publn_id AS id,
    npl_biblio
  FROM
    `usptobias.patstat.tls214`) AS tls214
ON
  tls214.id=parsing.npl_publn_id

Ideas/ solution

There seems to be a common pattern in these citations in the sense that they are already very structured (e.g NIELSEN F ET AL: 'HERSTELLUNG STAUBARMER, FREIFLIESSENDER PRODUKTE', CHEMIETECHNIK, HUTHIG, HEIDELBERG, DE, vol. 22, no. 10, 1 October 1993 (1993-10-01), pages 48 - 49, XP000415410, ISSN: 0340-9961).

At this stage, training the Grobid model on these examples seems to be the best option. Then, examples affected by this issue will be processed again.

Metadata

Metadata

Assignees

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions