Fix/hf #392

PGijsbers · 2024-11-18T16:05:52Z

Change

This PR encompasses two changes:

Update the trigger creation logic, so that the triggers actually get created.
Register huggingface datasets under their private identifier, instead of the public one which is subject to change (Multiple HuggingFace identifiers can point to the same dataset #385)

To accommodate the last change, there is also a script migrate_hf.py which checks huggingface datasets already present in the database and 1. removes them if their identifier is no longer valid and 2. updates their identifiers otherwise.

How to Test

Setting up an old database and verifying the wrong behavior

A bit of a doozy with so much going on. First, let's create a version of an "old" style database:

docker compose --profile "*" down
git checkout develop
./scripts/clean.sh
docker compose --env-file=.env --env-file=override.env --profile "huggingface-datasets" up -d

That will start populating the develop database with huggingface entries, old-style.
Keep track of progress with

docker exec -it sqlserver mysql -uroot -pok --database=aiod -e "select platform_resource_identifier, name from dataset where name in ('fka/awesome-chatgpt-prompts', 'PleIAs/common_corpus');"
docker exec -it sqlserver mysql -uroot -pok --database=aiod -e "select COUNT(*) from dataset; select COUNT(*) from ai_asset; select count(*) from ai_resource;"

and make sure some (>10) datasets are uploaded, and the first query shows both the fka/awesome-chatgpt-prompts and PleIAs/common_corpus datasets, since we will use them later.
Also note that no AIoD triggers are created:

docker exec -it sqlserver mysql -uroot -pok --database=information_schema -e 'select TRIGGER_NAME from TRIGGERS;'

Let's stop harvesting (assuming you have enough datasets):

docker container stop huggingface-dataset-connector

To really make sure no triggers happen, let's delete a dataset. Because the trigger listens for hard-deletes, and datasets are deleted "softly" by default, we briefly need to disable this behavior. Change the expression in
thisif statement to always be True.
Then visit localhost, log in with the test user, and delete a dataset (pick any identifier >=10 to avoid deleting one of the two datasets we need later). The dataset will be deleted successfully, but their relations remain:

docker exec -it sqlserver mysql -uroot -pok --database=aiod -e "select COUNT(*) from dataset; select COUNT(*) from ai_asset; select count(*) from ai_resource;"

Undo your change you made to the resource connector:

git checkout .

and shut down the services:

docker compose --profile "*" down

Verifying new behavior

Start the services again on the new branch (this may take a little longer):

git checkout fix/hf-id
docker compose --env-file=.env --env-file=override.env up -d

verify no data has changed, but delete triggers have been added:

docker exec -it sqlserver mysql -uroot -pok --database=information_schema -e 'select TRIGGER_NAME from TRIGGERS;'
docker exec -it sqlserver mysql -uroot -pok --database=aiod -e "select COUNT(*) from dataset; select COUNT(*) from ai_asset; select count(*) from ai_resource;"

(since the database cannot retroactively use the trigger, there will remain one fewer dataset than ai asset)

Now we modify some data to simulate the situation where a HF user has changed their account name (see also #385):

docker exec -it sqlserver mysql -uroot -pok --database=aiod -e "update dataset set platform_resource_identifier='NaiveDev/coco-2017-instance', name='NaiveDev/coco-2017-instance' where platform_resource_identifier='fka/awesome-chatgpt-prompts';
update dataset set platform_resource_identifier='aha-org/coco-2017-instance', name='aha-org/coco-2017-instance' where platform_resource_identifier='PleIAs/common_corpus';"

The user aha-org changed their name to NaiveDev, and so we now modify two dataset entries to simulate being an old and new version of the indexed data. This is the behavior the HF connector would have if encountered a renamed dataset, since uniqueness is only enforced with user/dataset.
Let's run our migration script:

docker run -it -v $(pwd):/all --network='AIOD-rest-api_default' --entrypoint=/bin/bash aiod_metadata_catalogue -c 'python /all/scripts/migrate_hf.py'

and let's verify that our HF entries are updated accordingly:

docker exec -it sqlserver mysql -uroot -pok --database=aiod -e "select platform_resource_identifier, name from dataset where name in ('aha-org/coco-2017-instance', 'NaiveDev/coco-2017-instance');"
docker exec -it sqlserver mysql -uroot -pok --database=aiod -e "select platform_resource_identifier, name from dataset; select COUNT(*) from dataset;"

We see the aha-org dataset was removed, along with its linked resources, and the NaiveDev had assigned their new id correctly. Moreover, the related ai resource and ai asset entry that related to the aha-org dataset are also deleted through their triggers. Poke around in the database some more to check everything is OK, if needed.
Now let's spin up the HF connector:

docker compose --env-file=.env --env-file=override.env --profile "huggingface-datasets" up -d

and after a while verify that new entries have the correct platform_resource_identifier associated (i.e. a hex string).

Checklist

Tests have been added or updated to reflect the changes, or their absence is explicitly explained.
Documentation has been added or updated to reflect the changes, or their absence is explicitly explained.
A self-review has been conducted checking:
- No unintended changes have been committed.
- The changes in isolation seem reasonable.
- Anything that may be odd or unintuitive is provided with a GitHub comment explaining it (but consider if this should not be a code comment or in the documentation instead).
All CI checks pass before pinging a reviewer, or provide an explanation if they do not.

Related Issues

The `id` of a dataset can change when a user changes its name or the datasets' name. The `_id` is persistent, which means that we can keep track of whether or not the dataset has already been indexed. If any part of the name changes, the old `name` simply redirects. In the worst case (bare _id becoming invalid), we could always reindex and match based on `_id`.

Taniya-Das

I have tested the PR thoroughly and it's working as expected. The code changes look good to me as well.

PGijsbers · 2024-11-19T14:21:50Z

For posterity; no response in HF forum w.r.t. usage of "_id" (https://discuss.huggingface.co/t/are-dataset-id-safe-to-use/122309)

src/connectors/huggingface/huggingface_dataset_connector.py

PGijsbers added 5 commits November 15, 2024 11:17

Migration script for using stable _id instead of id for HF

a9346a4

Skip datassets that have already been updated

6c03fed

Fix reference to wrong field for platform_resource_identifier

ba140ae

Make sure triggers get added on table creation, avoid hack

cc3c846

PGijsbers requested a review from Taniya-Das November 18, 2024 16:05

PGijsbers added 2 commits November 19, 2024 10:37

Make sure deletion triggers also added to existing database

4681714

Black formatting

f75fc25

PGijsbers marked this pull request as ready for review November 19, 2024 08:50

This was referenced Nov 19, 2024

Use same database setup logic in tests and production #393

Open

Modified deletion triggers are not updated in the database #394

Open

Taniya-Das approved these changes Nov 19, 2024

View reviewed changes

Taniya-Das reviewed Nov 19, 2024

View reviewed changes

src/connectors/huggingface/huggingface_dataset_connector.py Outdated Show resolved Hide resolved

Add comment referencing the PR/issue for documentation

370954e

PGijsbers merged commit 24407ee into develop Nov 20, 2024
1 check passed

PGijsbers deleted the fix/hf-id branch November 20, 2024 08:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix/hf #392

Fix/hf #392

Uh oh!

PGijsbers commented Nov 18, 2024 •

edited

Loading

Uh oh!

Taniya-Das left a comment

Uh oh!

PGijsbers commented Nov 19, 2024

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix/hf #392

Fix/hf #392

Uh oh!

Conversation

PGijsbers commented Nov 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change

How to Test

Setting up an old database and verifying the wrong behavior

Verifying new behavior

Checklist

Related Issues

Uh oh!

Taniya-Das left a comment

Choose a reason for hiding this comment

Uh oh!

PGijsbers commented Nov 19, 2024

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

PGijsbers commented Nov 18, 2024 •

edited

Loading