Skip to content

gauravvij/gliner-spacy-comparison-neo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Enron NER Comparative Evaluation: GLiNER vs spaCy

This repository contains a comparative analysis of GLiNER (a Zero-shot NER model) and spaCy (a traditional statistical NER framework) using the Enron Email Dataset.

Objective

Evaluate the performance, scalability, and flexibility of NER systems across 7 business-critical entity types:

  • PERSON, ORG, EMAIL, DATE, MONEY, PRODUCT, ERROR_CODE

Key Findings

  • GLiNER demonstrates superior flexibility for "dynamic" entities like ERROR_CODE and PRODUCT without requiring retraining.
  • spaCy offers significantly higher throughput (docs/sec) and lower latency, making it ideal for stable, high-volume production pipelines.
  • GLiNER is the recommended choice for evolving schemas where new entity types are frequently added.

Deliverables

  • Enron_NER_Comparative_Evaluation_Report.pdf: Detailed performance analysis, F1 scores, and throughput metrics.
  • processed_ner_dataset.json: Extracted entities with exact character-level span offsets.

Components

  • preprocess_data.py: Cleaning and sampling the Enron corpus.
  • generate_silver_standard.py: Creating a validation set for F1 scoring.
  • benchmark_models.py: Running GLiNER and spaCy inference.
  • generate_report.py: Metric calculation and PDF generation.

Performance Summary

Detailed metrics including Span-level F1 and docs/sec are available in the Evaluation Report.

Releases

No releases published

Packages

 
 
 

Contributors