This module is responsible for processing raw webpage data (HTML content) to extract meaningful information using Natural Language Processing (NLP).
- Parses HTML: Extracts the main title, content, and publication date from the raw HTML.
- Extracts Keywords: Identifies relevant keywords from the article's title and content (using
SpacyKeywordExtractor). - Classifies Content: Determines the topic or category of the article (using
ZeroShotClassifier). - Formats Output: Structures the extracted information (title, snippet, keywords, topics, dates) into a standardized
ProcessingResultobject.
The primary function is process_webpage, which takes WebpageData (containing URL, HTML content, etc.) as input and returns a ProcessingResult object or None if essential information (like the article date) cannot be extracted.