Skip to content

feat: Cache Labels for Taxonomy #7383

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: develop
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 30 additions & 11 deletions label_studio/data_manager/actions/cache_labels.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
from core.permissions import AllPermissions
from core.redis import start_job_async_or_sync
from tasks.models import Annotation, Prediction, Task
from label_studio_sdk.label_interface import LabelInterface

logger = logging.getLogger(__name__)
all_permissions = AllPermissions()
Expand All @@ -18,7 +19,9 @@ def cache_labels_job(project, queryset, **kwargs):
source_class = Annotation if source == 'annotations' else Prediction
control_tag = request_data.get('custom_control_tag') or request_data.get('control_tag')
with_counters = request_data.get('with_counters', 'Yes').lower() == 'yes'

label_interface = LabelInterface(project.label_config)
label_interface_tags = {tag.name:tag for tag in label_interface.find_tags('control')}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it's better to create on the fly after the line 80:
https://github.com/HumanSignal/label-studio/pull/7383/files#diff-edcc7be940299b140a6cfd30643998168b462e92102a5418d5f570d8babf0a53R80

else: 
    from_name = region['from_name']
    label_interface_tags[from_name] = label_interface.get_control(from_name)

because, as I wrote earilier:

Better to call it for every annotation and for every task on the fly and cache it on the fly as well. Your label config can be changed and a part of annotations will stay out of label config. So, the only way to know which from_names you need to process is just to go through all the annotations

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand. The project's label interface configuration could change DURING the label_caching process, granted, but wouldn't the label caching have to be restarted anyways to capture the changes? If it caches on the fly, then different parts of a cached column could have different formatting. That seems worse than a whole-ly out of date formatting. Can you explain? The way I have it right now, all control tags from the label interface configs, incl. everything that a from_name would point to, is cached, once, at the start of the cache_label operation.

If you're concerned about efficiency, then moving it out of the extract_labels( ) makes the most sense. The efficiency gained from moving it to where you propose is only relevant for when the "ALL" cache label tag is used. For any specific tag, it's only called once, because the potentially large for region in annotation.result: loop is mitigated by the region['from_name'] == control_tag clause. Moving it to where I've done means the expensive XML parsing only happens once and not 1 or N (ALL) times per Task per Annotation.

The way I have it you get taxonomy cache_labels columns that reflect the the Label Interface UI with minimal additional calculation overhead.

It's possible I'm missing something. Why is it important to re-parse the label interface config for every annotation? I'm happy to change my commits if a compelling use case is presented.


if source == 'annotations':
column_name = 'cache'
else:
Expand All @@ -38,7 +41,7 @@ def cache_labels_job(project, queryset, **kwargs):
task_labels = []
annotations = source_class.objects.filter(task=task).only('result')
for annotation in annotations:
labels = extract_labels(annotation, control_tag)
labels = extract_labels(annotation, control_tag, label_interface_tags)
task_labels.extend(labels)

# cache labels in separate data column
Expand All @@ -57,20 +60,36 @@ def cache_labels_job(project, queryset, **kwargs):
return {'response_code': 200, 'detail': f'Updated {len(tasks)} tasks'}


def extract_labels(annotation, control_tag):
def extract_labels(annotation, control_tag, label_interface_tags=None):
labels = []
for region in annotation.result:
# find regions with specific control tag name or just all regions if control tag is None
if (control_tag is None or region['from_name'] == control_tag) and 'value' in region:
# scan value for a field with list of strings,
# as bonus it will work with textareas too
# scan value for a field with list of strings (eg choices, textareas)
# or taxonomy (list of string-lists)
for key in region['value']:
if (
isinstance(region['value'][key], list)
and region['value'][key]
and isinstance(region['value'][key][0], str)
):
labels.extend(region['value'][key])
if region['value'][key] and isinstance(region['value'][key], list):

if key == 'taxonomy':
showFullPath = 'true'
pathSeparator = '/'
if label_interface_tags is not None and region['from_name'] in label_interface_tags:
# if from_name is not a custom_control tag, then we can try to fetch taxonomy formatting params
label_interface_tag = label_interface_tags[region['from_name']]
showFullPath = label_interface_tag.attr.get('showFullPath', 'false')
pathSeparator = label_interface_tag.attr.get('pathSeparator', '/')

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this label_interface.* operations are pretty heavy, because they work with XML parsing each time. I think, we need to cache them

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something like this draft:

# Add a cache dictionary outside the loop to store results for each 'from_name'
control_cache = {}

if label_interface is not None:
    # Check if we already cached this control
    if region['from_name'] not in control_cache:
        # Cache the result of find_tags and get_control for this 'from_name'
        if label_interface.find_tags('control', match_fn=lambda tag: tag.name == region['from_name']):
            control = label_interface.get_control(region['from_name'])
            control_cache[region['from_name']] = {
                'showFullPath': control.attr.get('showFullPath', 'false'),
                'pathSeparator': control.attr.get('pathSeparator', '/')
            }
        else:
            # Cache a default value if no control is found
            control_cache[region['from_name']] = {
                'showFullPath': 'false',
                'pathSeparator': '/'
            }
    ...
    # Use the cached values
    showFullPath = control_cache[region['from_name']]['showFullPath']
    pathSeparator = control_cache[region['from_name']]['pathSeparator']

    if showFullPath == 'false':
        for elems in region['value'][key]:
            labels.append(elems[-1])  # Just the leaf node of a taxonomy selection
    else:
        for elems in region['value'][key]:
            labels.append(pathSeparator.join(elems))  # The full delimited taxonomy path

if showFullPath == 'false':
for elems in region['value'][key]:
labels.append( elems[-1] ) # just the leaf node of a taxonomy selection
else:
for elems in region['value'][key]:
labels.append( pathSeparator.join(elems) ) # the full delimited taxonomy path

# other control tag types like Choices & TextAreas
elif isinstance(region['value'][key][0], str):
labels.extend(region['value'][key])

break
return labels

Expand Down
Loading