-
Notifications
You must be signed in to change notification settings - Fork 141
Create Cache
class for exact, fuzzy, and semantic deduplication
#384
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
sarahyurick
wants to merge
34
commits into
NVIDIA-NeMo:main
Choose a base branch
from
sarahyurick:global_cache_dir
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 30 commits
Commits
Show all changes
34 commits
Select commit
Hold shift + click to select a range
769e2ea
add global cache variable and use it for exact dedup
sarahyurick b77139c
global cache for semdedup
sarahyurick 337cec8
run black and modify pytest
sarahyurick 6d55d8c
update image notebook
sarahyurick 622912b
Merge branch 'main' into global_cache_dir
sarahyurick 4cb26d5
save fuzzy dedup progress
sarahyurick b001622
save progress
sarahyurick 0c14626
update remaining docs
sarahyurick 7486459
run black
sarahyurick 053f312
Merge branch 'main' into global_cache_dir
sarahyurick 1b1ba30
Merge branch 'main' into global_cache_dir
sarahyurick 4b12651
Merge branch 'main' into global_cache_dir
sarahyurick 4160471
Merge branch 'main' into global_cache_dir
sarahyurick 8a22ace
Merge branch 'main' into global_cache_dir
sarahyurick 5e9bef1
Merge branch 'main' into global_cache_dir
sarahyurick d823a0b
Merge remote-tracking branch 'upstream/main' into global_cache_dir
sarahyurick 0890fb0
re-add get_cache_directory changes
sarahyurick 8fd79fb
create Cache singleton class
sarahyurick 0d7b969
update exact_dedup
sarahyurick 2c1a435
add semdedup functionality with Cache
sarahyurick f0ff2ce
add semdedup_example script
sarahyurick a379893
Cache singleton option for fuzzy dedup
sarahyurick 67f609c
run black
sarahyurick 8693177
fix tutorials
sarahyurick c296cc7
Merge branch 'main' into global_cache_dir
sarahyurick 510347c
Merge branch 'main' into global_cache_dir
sarahyurick 0635ebf
run black
sarahyurick a229857
import assert_eq
sarahyurick 30ec409
fix semdedup test
sarahyurick 1a63468
Merge branch 'main' into global_cache_dir
sarahyurick 2075588
Merge branch 'main' into global_cache_dir
sarahyurick a6c5de3
remove repeating param
sarahyurick b805ce9
Merge remote-tracking branch 'upstream/main' into global_cache_dir
sarahyurick 2ee3547
fix semdedup test
sarahyurick File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
from nemo_curator.utils.file_utils import expand_outdir_and_mkdir | ||
|
||
|
||
class Cache: | ||
_instance = None | ||
_cache_dir = None | ||
|
||
def __new__(cls, cache_dir=None): | ||
if cls._instance is None: | ||
cls._instance = super(Cache, cls).__new__(cls) | ||
if cache_dir is not None: | ||
cls._cache_dir = expand_outdir_and_mkdir(cache_dir) | ||
else: | ||
cls._cache_dir = None | ||
elif cache_dir is not None and cls._cache_dir is None: | ||
cls._cache_dir = expand_outdir_and_mkdir(cache_dir) | ||
return cls._instance | ||
|
||
@classmethod | ||
def get_cache_directory(cls) -> str: | ||
""" | ||
Retrieve the cache directory. | ||
""" | ||
return cls._cache_dir | ||
|
||
@classmethod | ||
def delete_cache_instance(cls): | ||
""" | ||
Reset the Cache singleton. | ||
""" | ||
if cls._cache_dir is not None: | ||
cls._cache_dir = None | ||
|
||
cls._instance = None |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.