-
Notifications
You must be signed in to change notification settings - Fork 2.7k
feat: Cache Labels for Taxonomy #7383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Changes from all commits
20cb972
2d72746
2cbc80d
0c3a20f
5610b10
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,6 +6,7 @@ | |
from core.permissions import AllPermissions | ||
from core.redis import start_job_async_or_sync | ||
from tasks.models import Annotation, Prediction, Task | ||
from label_studio_sdk.label_interface import LabelInterface | ||
|
||
logger = logging.getLogger(__name__) | ||
all_permissions = AllPermissions() | ||
|
@@ -18,7 +19,9 @@ def cache_labels_job(project, queryset, **kwargs): | |
source_class = Annotation if source == 'annotations' else Prediction | ||
control_tag = request_data.get('custom_control_tag') or request_data.get('control_tag') | ||
with_counters = request_data.get('with_counters', 'Yes').lower() == 'yes' | ||
|
||
label_interface = LabelInterface(project.label_config) | ||
label_interface_tags = {tag.name:tag for tag in label_interface.find_tags('control')} | ||
|
||
if source == 'annotations': | ||
column_name = 'cache' | ||
else: | ||
|
@@ -38,7 +41,7 @@ def cache_labels_job(project, queryset, **kwargs): | |
task_labels = [] | ||
annotations = source_class.objects.filter(task=task).only('result') | ||
for annotation in annotations: | ||
labels = extract_labels(annotation, control_tag) | ||
labels = extract_labels(annotation, control_tag, label_interface_tags) | ||
task_labels.extend(labels) | ||
|
||
# cache labels in separate data column | ||
|
@@ -57,20 +60,36 @@ def cache_labels_job(project, queryset, **kwargs): | |
return {'response_code': 200, 'detail': f'Updated {len(tasks)} tasks'} | ||
|
||
|
||
def extract_labels(annotation, control_tag): | ||
def extract_labels(annotation, control_tag, label_interface_tags=None): | ||
labels = [] | ||
for region in annotation.result: | ||
# find regions with specific control tag name or just all regions if control tag is None | ||
if (control_tag is None or region['from_name'] == control_tag) and 'value' in region: | ||
# scan value for a field with list of strings, | ||
# as bonus it will work with textareas too | ||
# scan value for a field with list of strings (eg choices, textareas) | ||
# or taxonomy (list of string-lists) | ||
for key in region['value']: | ||
if ( | ||
isinstance(region['value'][key], list) | ||
and region['value'][key] | ||
and isinstance(region['value'][key][0], str) | ||
): | ||
labels.extend(region['value'][key]) | ||
if region['value'][key] and isinstance(region['value'][key], list): | ||
|
||
if key == 'taxonomy': | ||
showFullPath = 'true' | ||
pathSeparator = '/' | ||
if label_interface_tags is not None and region['from_name'] in label_interface_tags: | ||
# if from_name is not a custom_control tag, then we can try to fetch taxonomy formatting params | ||
label_interface_tag = label_interface_tags[region['from_name']] | ||
showFullPath = label_interface_tag.attr.get('showFullPath', 'false') | ||
pathSeparator = label_interface_tag.attr.get('pathSeparator', '/') | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this label_interface.* operations are pretty heavy, because they work with XML parsing each time. I think, we need to cache them There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. something like this draft:
|
||
if showFullPath == 'false': | ||
for elems in region['value'][key]: | ||
labels.append( elems[-1] ) # just the leaf node of a taxonomy selection | ||
else: | ||
for elems in region['value'][key]: | ||
labels.append( pathSeparator.join(elems) ) # the full delimited taxonomy path | ||
|
||
# other control tag types like Choices & TextAreas | ||
elif isinstance(region['value'][key][0], str): | ||
labels.extend(region['value'][key]) | ||
|
||
break | ||
return labels | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it's better to create on the fly after the line 80:
https://github.com/HumanSignal/label-studio/pull/7383/files#diff-edcc7be940299b140a6cfd30643998168b462e92102a5418d5f570d8babf0a53R80
because, as I wrote earilier:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really understand. The project's label interface configuration could change DURING the label_caching process, granted, but wouldn't the label caching have to be restarted anyways to capture the changes? If it caches on the fly, then different parts of a cached column could have different formatting. That seems worse than a whole-ly out of date formatting. Can you explain? The way I have it right now, all control tags from the label interface configs, incl. everything that a from_name would point to, is cached, once, at the start of the cache_label operation.
If you're concerned about efficiency, then moving it out of the
extract_labels( )
makes the most sense. The efficiency gained from moving it to where you propose is only relevant for when the "ALL" cache label tag is used. For any specific tag, it's only called once, because the potentially largefor region in annotation.result:
loop is mitigated by theregion['from_name'] == control_tag
clause. Moving it to where I've done means the expensive XML parsing only happens once and not 1 or N (ALL) times per Task per Annotation.The way I have it you get taxonomy cache_labels columns that reflect the the Label Interface UI with minimal additional calculation overhead.
It's possible I'm missing something. Why is it important to re-parse the label interface config for every annotation? I'm happy to change my commits if a compelling use case is presented.