Skip to content

Conversation

@PGijsbers
Copy link
Contributor

@PGijsbers PGijsbers commented Nov 18, 2024

Change

This PR encompasses two changes:

To accommodate the last change, there is also a script migrate_hf.py which checks huggingface datasets already present in the database and 1. removes them if their identifier is no longer valid and 2. updates their identifiers otherwise.

How to Test

Setting up an old database and verifying the wrong behavior

A bit of a doozy with so much going on. First, let's create a version of an "old" style database:

docker compose --profile "*" down
git checkout develop
./scripts/clean.sh
docker compose --env-file=.env --env-file=override.env --profile "huggingface-datasets" up -d

That will start populating the develop database with huggingface entries, old-style.
Keep track of progress with

docker exec -it sqlserver mysql -uroot -pok --database=aiod -e "select platform_resource_identifier, name from dataset where name in ('fka/awesome-chatgpt-prompts', 'PleIAs/common_corpus');"
docker exec -it sqlserver mysql -uroot -pok --database=aiod -e "select COUNT(*) from dataset; select COUNT(*) from ai_asset; select count(*) from ai_resource;"

and make sure some (>10) datasets are uploaded, and the first query shows both the fka/awesome-chatgpt-prompts and PleIAs/common_corpus datasets, since we will use them later.
Also note that no AIoD triggers are created:

docker exec -it sqlserver mysql -uroot -pok --database=information_schema -e 'select TRIGGER_NAME from TRIGGERS;'

Let's stop harvesting (assuming you have enough datasets):

docker container stop huggingface-dataset-connector

To really make sure no triggers happen, let's delete a dataset. Because the trigger listens for hard-deletes, and datasets are deleted "softly" by default, we briefly need to disable this behavior. Change the expression in
thisif statement to always be True.
Then visit localhost, log in with the test user, and delete a dataset (pick any identifier >=10 to avoid deleting one of the two datasets we need later). The dataset will be deleted successfully, but their relations remain:

docker exec -it sqlserver mysql -uroot -pok --database=aiod -e "select COUNT(*) from dataset; select COUNT(*) from ai_asset; select count(*) from ai_resource;"

Undo your change you made to the resource connector:

git checkout .

and shut down the services:

docker compose --profile "*" down

Verifying new behavior

Start the services again on the new branch (this may take a little longer):

git checkout fix/hf-id
docker compose --env-file=.env --env-file=override.env up -d 

verify no data has changed, but delete triggers have been added:

docker exec -it sqlserver mysql -uroot -pok --database=information_schema -e 'select TRIGGER_NAME from TRIGGERS;'
docker exec -it sqlserver mysql -uroot -pok --database=aiod -e "select COUNT(*) from dataset; select COUNT(*) from ai_asset; select count(*) from ai_resource;"  

(since the database cannot retroactively use the trigger, there will remain one fewer dataset than ai asset)

Now we modify some data to simulate the situation where a HF user has changed their account name (see also #385):

docker exec -it sqlserver mysql -uroot -pok --database=aiod -e "update dataset set platform_resource_identifier='NaiveDev/coco-2017-instance', name='NaiveDev/coco-2017-instance' where platform_resource_identifier='fka/awesome-chatgpt-prompts';
update dataset set platform_resource_identifier='aha-org/coco-2017-instance', name='aha-org/coco-2017-instance' where platform_resource_identifier='PleIAs/common_corpus';"

The user aha-org changed their name to NaiveDev, and so we now modify two dataset entries to simulate being an old and new version of the indexed data. This is the behavior the HF connector would have if encountered a renamed dataset, since uniqueness is only enforced with user/dataset.
Let's run our migration script:

docker run -it -v $(pwd):/all --network='AIOD-rest-api_default' --entrypoint=/bin/bash aiod_metadata_catalogue -c 'python /all/scripts/migrate_hf.py'

and let's verify that our HF entries are updated accordingly:

docker exec -it sqlserver mysql -uroot -pok --database=aiod -e "select platform_resource_identifier, name from dataset where name in ('aha-org/coco-2017-instance', 'NaiveDev/coco-2017-instance');"
docker exec -it sqlserver mysql -uroot -pok --database=aiod -e "select platform_resource_identifier, name from dataset; select COUNT(*) from dataset;"

We see the aha-org dataset was removed, along with its linked resources, and the NaiveDev had assigned their new id correctly. Moreover, the related ai resource and ai asset entry that related to the aha-org dataset are also deleted through their triggers. Poke around in the database some more to check everything is OK, if needed.
Now let's spin up the HF connector:

docker compose --env-file=.env --env-file=override.env --profile "huggingface-datasets" up -d

and after a while verify that new entries have the correct platform_resource_identifier associated (i.e. a hex string).

Checklist

  • Tests have been added or updated to reflect the changes, or their absence is explicitly explained.
  • Documentation has been added or updated to reflect the changes, or their absence is explicitly explained.
  • A self-review has been conducted checking:
    • No unintended changes have been committed.
    • The changes in isolation seem reasonable.
    • Anything that may be odd or unintuitive is provided with a GitHub comment explaining it (but consider if this should not be a code comment or in the documentation instead).
  • All CI checks pass before pinging a reviewer, or provide an explanation if they do not.

Related Issues

The `id` of a dataset can change when a user changes its name or
the datasets' name. The `_id` is persistent, which means that we
can keep track of whether or not the dataset has already been
indexed. If any part of the name changes, the old `name` simply
redirects. In the worst case (bare _id becoming invalid), we could
always reindex and match based on `_id`.
@PGijsbers PGijsbers requested a review from Taniya-Das November 18, 2024 16:05
Copy link
Collaborator

@Taniya-Das Taniya-Das left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tested the PR thoroughly and it's working as expected. The code changes look good to me as well.

@PGijsbers
Copy link
Contributor Author

For posterity; no response in HF forum w.r.t. usage of "_id" (https://discuss.huggingface.co/t/are-dataset-id-safe-to-use/122309)

@PGijsbers PGijsbers merged commit 24407ee into develop Nov 20, 2024
1 check passed
@PGijsbers PGijsbers deleted the fix/hf-id branch November 20, 2024 08:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants