-
Notifications
You must be signed in to change notification settings - Fork 19
Fix/hf #392
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The `id` of a dataset can change when a user changes its name or the datasets' name. The `_id` is persistent, which means that we can keep track of whether or not the dataset has already been indexed. If any part of the name changes, the old `name` simply redirects. In the worst case (bare _id becoming invalid), we could always reindex and match based on `_id`.
This was referenced Nov 19, 2024
Taniya-Das
approved these changes
Nov 19, 2024
Collaborator
Taniya-Das
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tested the PR thoroughly and it's working as expected. The code changes look good to me as well.
Contributor
Author
|
For posterity; no response in HF forum w.r.t. usage of "_id" (https://discuss.huggingface.co/t/are-dataset-id-safe-to-use/122309) |
Taniya-Das
reviewed
Nov 19, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Change
This PR encompasses two changes:
To accommodate the last change, there is also a script
migrate_hf.pywhich checks huggingface datasets already present in the database and 1. removes them if their identifier is no longer valid and 2. updates their identifiers otherwise.How to Test
Setting up an old database and verifying the wrong behavior
A bit of a doozy with so much going on. First, let's create a version of an "old" style database:
That will start populating the
developdatabase with huggingface entries, old-style.Keep track of progress with
and make sure some (>10) datasets are uploaded, and the first query shows both the
fka/awesome-chatgpt-promptsandPleIAs/common_corpusdatasets, since we will use them later.Also note that no AIoD triggers are created:
Let's stop harvesting (assuming you have enough datasets):
To really make sure no triggers happen, let's delete a dataset. Because the trigger listens for hard-deletes, and datasets are deleted "softly" by default, we briefly need to disable this behavior. Change the expression in
this
ifstatement to always beTrue.Then visit localhost, log in with the test user, and delete a dataset (pick any identifier >=10 to avoid deleting one of the two datasets we need later). The dataset will be deleted successfully, but their relations remain:
Undo your change you made to the resource connector:
and shut down the services:
Verifying new behavior
Start the services again on the new branch (this may take a little longer):
verify no data has changed, but delete triggers have been added:
(since the database cannot retroactively use the trigger, there will remain one fewer dataset than ai asset)
Now we modify some data to simulate the situation where a HF user has changed their account name (see also #385):
The user
aha-orgchanged their name toNaiveDev, and so we now modify two dataset entries to simulate being an old and new version of the indexed data. This is the behavior the HF connector would have if encountered a renamed dataset, since uniqueness is only enforced withuser/dataset.Let's run our migration script:
and let's verify that our HF entries are updated accordingly:
We see the
aha-orgdataset was removed, along with its linked resources, and theNaiveDevhad assigned their new id correctly. Moreover, the related ai resource and ai asset entry that related to theaha-orgdataset are also deleted through their triggers. Poke around in the database some more to check everything is OK, if needed.Now let's spin up the HF connector:
and after a while verify that new entries have the correct
platform_resource_identifierassociated (i.e. a hex string).Checklist
Related Issues