-
Notifications
You must be signed in to change notification settings - Fork 2.7k
feat: Cache Labels for Taxonomy #7383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
👷 Deploy request for heartex-docs pending review.Visit the deploys page to approve it
|
👷 Deploy request for label-studio-docs-new-theme pending review.Visit the deploys page to approve it
|
✅ Deploy Preview for label-studio-storybook canceled.
|
In discussions with Max, it was determined that displaying just the leaf node of a selected taxonomy was not always desirable, as important information about taxonomic hierarchy paths can be lost. I've made an update such that cache_labels will match the formatting schema of the given Taxonomy tag's labeling interface Now Defaults are respected, so And all defaults, ie a basic Defaults are hardcoded and come from Label Interface Taxonomy Params . If the taxonomy control tag is a Custom control tag, naturally it will not have any Labeling Interface configuration to specify its formatting. In this case, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (1)
label_studio/data_manager/actions/cache_labels.py:73
- [nitpick] Consider renaming 'showFullPath' to 'show_full_path' and 'pathSeparator' to 'path_separator' for consistency with Python naming conventions.
showFullPath = 'true'
# if from_name is not a custom_control tag, then we can try to fetch taxonomy formatting params | ||
showFullPath = label_interface.get_control(region['from_name']).attr.get('showFullPath', 'false') | ||
pathSeparator = label_interface.get_control(region['from_name']).attr.get('pathSeparator', '/') | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this label_interface.* operations are pretty heavy, because they work with XML parsing each time. I think, we need to cache them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something like this draft:
# Add a cache dictionary outside the loop to store results for each 'from_name'
control_cache = {}
if label_interface is not None:
# Check if we already cached this control
if region['from_name'] not in control_cache:
# Cache the result of find_tags and get_control for this 'from_name'
if label_interface.find_tags('control', match_fn=lambda tag: tag.name == region['from_name']):
control = label_interface.get_control(region['from_name'])
control_cache[region['from_name']] = {
'showFullPath': control.attr.get('showFullPath', 'false'),
'pathSeparator': control.attr.get('pathSeparator', '/')
}
else:
# Cache a default value if no control is found
control_cache[region['from_name']] = {
'showFullPath': 'false',
'pathSeparator': '/'
}
...
# Use the cached values
showFullPath = control_cache[region['from_name']]['showFullPath']
pathSeparator = control_cache[region['from_name']]['pathSeparator']
if showFullPath == 'false':
for elems in region['value'][key]:
labels.append(elems[-1]) # Just the leaf node of a taxonomy selection
else:
for elems in region['value'][key]:
labels.append(pathSeparator.join(elems)) # The full delimited taxonomy path
new commit: I've pre-loaded the label interface tags prior to the tasks and annotations loops. The project's label interface control tags are easily accessible via dict now, should be orders of magnitude more efficient. |
@@ -18,7 +19,9 @@ def cache_labels_job(project, queryset, **kwargs): | |||
source_class = Annotation if source == 'annotations' else Prediction | |||
control_tag = request_data.get('custom_control_tag') or request_data.get('control_tag') | |||
with_counters = request_data.get('with_counters', 'Yes').lower() == 'yes' | |||
|
|||
label_interface = LabelInterface(project.label_config) | |||
label_interface_tags = {tag.name:tag for tag in label_interface.find_tags('control')} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it's better to create on the fly after the line 80:
https://github.com/HumanSignal/label-studio/pull/7383/files#diff-edcc7be940299b140a6cfd30643998168b462e92102a5418d5f570d8babf0a53R80
else:
from_name = region['from_name']
label_interface_tags[from_name] = label_interface.get_control(from_name)
because, as I wrote earilier:
Better to call it for every annotation and for every task on the fly and cache it on the fly as well. Your label config can be changed and a part of annotations will stay out of label config. So, the only way to know which from_names you need to process is just to go through all the annotations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really understand. The project's label interface configuration could change DURING the label_caching process, granted, but wouldn't the label caching have to be restarted anyways to capture the changes? If it caches on the fly, then different parts of a cached column could have different formatting. That seems worse than a whole-ly out of date formatting. Can you explain? The way I have it right now, all control tags from the label interface configs, incl. everything that a from_name would point to, is cached, once, at the start of the cache_label operation.
If you're concerned about efficiency, then moving it out of the extract_labels( )
makes the most sense. The efficiency gained from moving it to where you propose is only relevant for when the "ALL" cache label tag is used. For any specific tag, it's only called once, because the potentially large for region in annotation.result:
loop is mitigated by the region['from_name'] == control_tag
clause. Moving it to where I've done means the expensive XML parsing only happens once and not 1 or N (ALL) times per Task per Annotation.
The way I have it you get taxonomy cache_labels columns that reflect the the Label Interface UI with minimal additional calculation overhead.
It's possible I'm missing something. Why is it important to re-parse the label interface config for every annotation? I'm happy to change my commits if a compelling use case is presented.
See issue #7382
This fix allows for the value(s) of Taxonomy annotations to be displayed by the experimental Cache Labels feature.
Prior to this PR, Cache Labels would create an all-blank column when applied to a Control Tag for a Taxonomy annotation.
With this PR, values selected in a Taxonomy annotation are correctly displayed.
The issue arose from Taxonomy annotation values being returned as a list, as opposed to a str. For Taxonomy type annotations, a list is required for each selected element so as to capture the hierarchical path taken for any given value in a taxonomy tree. In this PR, consideration was taken so as to only show the selected value, and not capture parts of the path leading up to the selected value. One or more selected values, as well as leaf and intermediate/branch values are supported but this fix.
This is a very simple, very small, low risk PR. It is also my first PR with Label Studio, yay!
My thanks to @makseq who pointed me towards the
cache_labels.py
file lines most likely responsible for the blank-taxonomy-cache-labels bug.