Skip to content

Fix/multi sdg#19

Merged
lpi-tn merged 10 commits into
mainfrom
Fix/multi-sdg
Apr 28, 2025
Merged

Fix/multi sdg#19
lpi-tn merged 10 commits into
mainfrom
Fix/multi-sdg

Conversation

@lpi-tn
Copy link
Copy Markdown
Collaborator

@lpi-tn lpi-tn commented Apr 25, 2025

This pull request introduces significant enhancements to the SDG (Sustainable Development Goals) classification logic, including updates to the n_classify_slice function, modifications to tests to reflect these changes, and improvements in batch generation functionality. The most notable updates include introducing a forced SDG mechanism, improving the classification logic with probability sorting, and adding support for corpus filtering in batch generation.

Enhancements to SDG Classification Logic:

  • Updated n_classify_slice to include a forced_sdg parameter, allowing the classification to prioritize specific SDGs. The function now uses predict_proba for probability-based classification and sorts SDGs by their scores, applying a threshold of 0.5 for classification. [1] [2]
  • Modified the main function in document_classifier.py to pass forced_sdg values when external SDGs are provided, ensuring alignment with the new classification logic.

Test Updates:

  • Refactored test_main_externally_classified in test_document_classifier.py to use the updated n_classify_slice logic and removed redundant slice_test_id initialization. [1] [2]
  • Added new test cases in test_sdgs_classifiers.py to validate the behavior of n_classify_slice with and without forced_sdg, ensuring correct SDG assignment or non-classification.

Batch Generation Improvements:

  • Added a new corpus_name parameter to generate_to_classify_batch.py, enabling filtering of documents by corpus during batch generation. [1] [2]

Miscellaneous:

  • Imported Pipeline from sklearn.pipeline to ensure type safety for the loaded classifier model.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances the SDG classification logic by introducing a forced SDG mechanism, updating the n_classify_slice function to use predict_proba and probability sorting, and incorporating corpus-based filtering in batch generation.

  • Enhanced n_classify_slice to accept a forced SDG list and sort predictions by probability.
  • Updated tests for n_classify_slice to cover both forced and non-forced classification scenarios.
  • Added a corpus_name parameter for filtering documents in batch generation.

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file
File Description
welearn_datastack/nodes_workflow/DocumentClassifier/generate_to_classify_batch.py Added the corpus_name parameter to allow for corpus filtering during batch generation.
welearn_datastack/nodes_workflow/DocumentClassifier/document_classifier.py Updated n_classify_slice invocation to include the forced_sdg parameter and revised logic.
welearn_datastack/modules/sdgs_classifiers.py Modified n_classify_slice to use predict_proba and probability sorting with forced SDG.
tests/document_classifier/test_sdgs_classifiers.py Adjusted tests to validate n_classify_slice behavior with forced SDG scenarios.
tests/document_classifier/test_document_classifier.py Removed redundant slice initialization and updated mock return values for consistency.
Comments suppressed due to low confidence (1)

welearn_datastack/nodes_workflow/DocumentClassifier/document_classifier.py:112

  • Consider renaming 'externaly_classified_flag' to 'externally_classified_flag' to improve clarity and consistency.
forced_sdg=(
                        s.document.details[key_external_sdg]
                        if externaly_classified_flag
                        else None
                    ),

Comment thread welearn_datastack/modules/sdgs_classifiers.py Outdated
Comment thread welearn_datastack/modules/sdgs_classifiers.py
Comment thread welearn_datastack/modules/sdgs_classifiers.py
Co-authored-by: Sandra Guerreiro  <sandragjacinto@gmail.com>
@lpi-tn lpi-tn merged commit 965d492 into main Apr 28, 2025
7 checks passed
@lpi-tn lpi-tn deleted the Fix/multi-sdg branch April 28, 2025 08:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants