Fix/multi sdg#19
Merged
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull Request Overview
This PR enhances the SDG classification logic by introducing a forced SDG mechanism, updating the n_classify_slice function to use predict_proba and probability sorting, and incorporating corpus-based filtering in batch generation.
- Enhanced n_classify_slice to accept a forced SDG list and sort predictions by probability.
- Updated tests for n_classify_slice to cover both forced and non-forced classification scenarios.
- Added a corpus_name parameter for filtering documents in batch generation.
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| welearn_datastack/nodes_workflow/DocumentClassifier/generate_to_classify_batch.py | Added the corpus_name parameter to allow for corpus filtering during batch generation. |
| welearn_datastack/nodes_workflow/DocumentClassifier/document_classifier.py | Updated n_classify_slice invocation to include the forced_sdg parameter and revised logic. |
| welearn_datastack/modules/sdgs_classifiers.py | Modified n_classify_slice to use predict_proba and probability sorting with forced SDG. |
| tests/document_classifier/test_sdgs_classifiers.py | Adjusted tests to validate n_classify_slice behavior with forced SDG scenarios. |
| tests/document_classifier/test_document_classifier.py | Removed redundant slice initialization and updated mock return values for consistency. |
Comments suppressed due to low confidence (1)
welearn_datastack/nodes_workflow/DocumentClassifier/document_classifier.py:112
- Consider renaming 'externaly_classified_flag' to 'externally_classified_flag' to improve clarity and consistency.
forced_sdg=(
s.document.details[key_external_sdg]
if externaly_classified_flag
else None
),
sandragjacinto
approved these changes
Apr 25, 2025
Co-authored-by: Sandra Guerreiro <sandragjacinto@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request introduces significant enhancements to the SDG (Sustainable Development Goals) classification logic, including updates to the
n_classify_slicefunction, modifications to tests to reflect these changes, and improvements in batch generation functionality. The most notable updates include introducing a forced SDG mechanism, improving the classification logic with probability sorting, and adding support for corpus filtering in batch generation.Enhancements to SDG Classification Logic:
n_classify_sliceto include aforced_sdgparameter, allowing the classification to prioritize specific SDGs. The function now usespredict_probafor probability-based classification and sorts SDGs by their scores, applying a threshold of 0.5 for classification. [1] [2]mainfunction indocument_classifier.pyto passforced_sdgvalues when external SDGs are provided, ensuring alignment with the new classification logic.Test Updates:
test_main_externally_classifiedintest_document_classifier.pyto use the updatedn_classify_slicelogic and removed redundantslice_test_idinitialization. [1] [2]test_sdgs_classifiers.pyto validate the behavior ofn_classify_slicewith and withoutforced_sdg, ensuring correct SDG assignment or non-classification.Batch Generation Improvements:
corpus_nameparameter togenerate_to_classify_batch.py, enabling filtering of documents by corpus during batch generation. [1] [2]Miscellaneous:
Pipelinefromsklearn.pipelineto ensure type safety for the loaded classifier model.