This repository contains a comparative analysis of GLiNER (a Zero-shot NER model) and spaCy (a traditional statistical NER framework) using the Enron Email Dataset.
Evaluate the performance, scalability, and flexibility of NER systems across 7 business-critical entity types:
PERSON,ORG,EMAIL,DATE,MONEY,PRODUCT,ERROR_CODE
- GLiNER demonstrates superior flexibility for "dynamic" entities like
ERROR_CODEandPRODUCTwithout requiring retraining. - spaCy offers significantly higher throughput (docs/sec) and lower latency, making it ideal for stable, high-volume production pipelines.
- GLiNER is the recommended choice for evolving schemas where new entity types are frequently added.
Enron_NER_Comparative_Evaluation_Report.pdf: Detailed performance analysis, F1 scores, and throughput metrics.processed_ner_dataset.json: Extracted entities with exact character-level span offsets.
preprocess_data.py: Cleaning and sampling the Enron corpus.generate_silver_standard.py: Creating a validation set for F1 scoring.benchmark_models.py: Running GLiNER and spaCy inference.generate_report.py: Metric calculation and PDF generation.
Detailed metrics including Span-level F1 and docs/sec are available in the Evaluation Report.