Skip to content

Commit 4e48d88

Browse files
authored
chore: removed all gretel mentions when appropriate (#219)
# Summary All "gretel" mentions in `src/` have been removed or replaced with product-appropriate alternatives ("nss" for Nemo Safe Synthesizer). External dataset URLs referencing `gretelai/gretel-blueprints` are kept as-is since they point to a public data source. ## Code Identifiers Renamed | File | Old | New | |---|---|---| | `records/fragment.py` | `gretel_id` (attribute + params) | `record_id` | | `records/fragment.py` | `gretel_fragment_ts` | `fragment_ts` | | `records/fragment.py` | `gretel_fragment_epoch` | `fragment_epoch` | | `records/fragment.py` | `gretel_fragment_datetime` | `fragment_datetime` | | `records/json_record.py` | `remove_gretel_array_markers()` | `remove_array_markers()` | | `records/base.py` | `"_gretelarray_"` | `"_nssarray_"` | | `actions/utils.py` | `"__gretel__idx"` | `"__nss__idx"` | | `actions/utils.py` | `"__gretel_reject_reason"` | `"__nss_reject_reason"` | ## Constants Renamed (`pii_replacer/ner/const.py`) | Old Key / Value | New Key / Value | |---|---| | `GRETEL_ID` / `_gretel_id` | `NSS_ID` / `_nss_id` | | `GRETEL_TS` / `_gretel_ts` | `NSS_TS` / `_nss_ts` | | `GRETEL_SUB` / `_gretel_subscriber` | `NSS_SUB` / `_nss_subscriber` | | `ARRAY_POS` / `_gretelarray_` | `ARRAY_POS` / `_nssarray_` | ## Environment Variables Renamed (`pii_replacer/ner/models.py`) | Old | New | |---|---| | `GRETEL_OPT_BUCKET` | `NSS_OPT_BUCKET` | | `gretel-opt-dev-use2` (default bucket) | `nss-opt-dev-use2` | | `GRETEL_OPT_CACHE_DIR` | `NSS_OPT_CACHE_DIR` | ## Comments / Documentation Cleaned Up | File | Change | |---|---| | `src/.../record_utils.py` | "gretel client" → "nss client" | | `src/.../pii_replacer/ner/models.py` | Removed "gretel" from docstrings about visibility and cache management | | `src/.../pii_replacer/ner/regexes/age.py` | Replaced internal GitHub issue URL with descriptive comment | | `src/.../pii_replacer/ner/regexes/sex_gender.py` | Replaced internal GitHub issue URL with descriptive comment | | `src/.../artifacts/analyzers/field_features.py` | Removed internal GitHub repo URL | | `src/.../records/fragment.py` | Error message: "different gretel records" → "different records" | | `tests/evaluation/conftest.py` | "gretel core arfifact classifier" → "core artifact classifier" | ## Kept As-Is - External dataset URLs referencing `gretelai/gretel-blueprints` (third-party public data source) - Historical tag references in test comments (`gretel-2025-02-25 tag`) ## Migration Notes - **Environment variables**: Anyone previously setting `GRETEL_OPT_BUCKET` or `GRETEL_OPT_CACHE_DIR` must update to `NSS_OPT_BUCKET` / `NSS_OPT_CACHE_DIR`. ## Pre-Review Checklist <!-- These checks should be completed before a PR is reviewed, --> <!-- but you can submit a draft early to indicate that the issue is being worked on. --> Ensure that the following pass: - [x] `make format && make check` or via prek validation. - [x] `make test` passes locally - [x] `make test-e2e` passes locally - [ ] `make test-ci-container` passes locally (recommended) ## Pre-Merge Checklist <!-- These checks need to be completed before a PR is merged, --> <!-- but as PRs often change significantly during review, --> <!-- it's OK for them to be incomplete when review is first requested. --> - [ ] New or updated tests for any fix or new behavior - [ ] Updated documentation for new features and behaviors, including docstrings for API docs. ## Other Notes <!-- Please add the issue number that should be closed when this PR is merged. --> - Closes #218 --------- Signed-off-by: Sean Yang <seayang@nvidia.com> Signed-off-by: seayang-nv <seayang@nvidia.com>
1 parent 386004f commit 4e48d88

10 files changed

Lines changed: 40 additions & 42 deletions

File tree

src/nemo_safe_synthesizer/artifacts/analyzers/field_features.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -140,7 +140,6 @@ def describe_field(field_name: str, data: Series) -> FieldFeatures:
140140
return features
141141

142142
# See
143-
# - https://github.com/Gretellabs/text_research/blob/main/column_detector.py
144143
# - https://jeffreymorgan.io/articles/identifying-categorical-data/
145144
diff = non_na_count - unique_count
146145
diff_percent = diff / non_na_count

src/nemo_safe_synthesizer/data_processing/record_utils.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -387,7 +387,7 @@ def normalize_dataframe(dataframe: pd.DataFrame) -> pd.DataFrame:
387387
"""
388388
# HACK: Handle NaN/None/NA values with mixed types by
389389
# normalizing through pandas csv io format, which will match
390-
# the format in reports generated via the gretel client.
390+
# the format in reports generated via the nss client.
391391
try:
392392
# try without trying to resolve utf-8 issues first
393393
return pd.read_csv(StringIO(dataframe.to_csv(index=False, quoting=QUOTE_NONNUMERIC)))

src/nemo_safe_synthesizer/data_processing/records/base.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@
3232
BOOL = "boolean"
3333
NUMBER = "number"
3434
NULL = "null"
35-
ARRAY_POS = "_gretelarray_"
35+
ARRAY_POS = "_nssarray_"
3636
NESTING_DELIM = "*#N#*"
3737
DELIM = "."
3838

src/nemo_safe_synthesizer/data_processing/records/fragment.py

Lines changed: 22 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -40,8 +40,7 @@ class Metadata:
4040
field_name -> fragment_name -> metadata_type -> [metadata_items]
4141
"""
4242

43-
gretel_id: str
44-
"""Unique identifier for the source record."""
43+
record_id: str
4544

4645
fields: dict
4746
"""Nested dict of per-field, per-fragment metadata."""
@@ -65,24 +64,24 @@ class MetadataFragment:
6564
``Metadata`` object per record.
6665
6766
Args:
68-
gretel_id: Unique identifier for the source record.
69-
gretel_fragment_ts: ISO-8601 timestamp string.
70-
gretel_fragment_epoch: Unix epoch of the fragment creation.
67+
record_id: Unique identifier for the source record.
68+
fragment_ts: ISO-8601 timestamp string.
69+
fragment_epoch: Unix epoch of the fragment creation.
7170
fragment_name: Identifier for this annotation pass (e.g., ``"ner"``).
7271
"""
7372

74-
gretel_id: str
75-
gretel_fragment_ts: str
76-
gretel_fragment_epoch: float
73+
record_id: str
74+
fragment_ts: str
75+
fragment_epoch: float
7776
fragment_name: str
7877

7978
def __post_init__(self):
8079
self.fields = defaultdict(lambda: defaultdict(list))
8180

8281
@property
83-
def gretel_fragment_datetime(self) -> datetime:
82+
def fragment_datetime(self) -> datetime:
8483
"""Fragment creation time as a ``datetime`` object."""
85-
return datetime.fromtimestamp(self.gretel_fragment_epoch)
84+
return datetime.fromtimestamp(self.fragment_epoch)
8685

8786
def add_field_data(self, field_name: str, metadata_type: str, field_data: dict | list):
8887
"""Append metadata entries for a field.
@@ -121,30 +120,30 @@ def merge_fragments(*fragments, ts: str | None = None) -> Metadata:
121120
Raises:
122121
MetadataError: If the fragments have different ``gretel_id`` values.
123122
"""
124-
if len(set([fragment.gretel_id for fragment in fragments])) != 1:
125-
raise MetadataError("cannot merge fragments from different gretel records")
123+
if len(set([fragment.record_id for fragment in fragments])) != 1:
124+
raise MetadataError("cannot merge fragments from different records")
126125
else:
127-
gretel_id = fragments[0].gretel_id
126+
record_id = fragments[0].record_id
128127

129128
# todo(dn): there might be a better way to build up this object
130129
merged_fragment = defaultdict(lambda: defaultdict(lambda: defaultdict(list)))
131-
ts = ts or min([f.gretel_fragment_datetime for f in fragments]).isoformat() + "Z"
130+
ts = ts or min([f.fragment_datetime for f in fragments]).isoformat() + "Z"
132131
fragment: MetadataFragment
133132
for fragment in fragments:
134133
for field_name, field_data in fragment.fields.items():
135134
for meta_type, meta_data in field_data.items():
136135
merged_fragment[field_name][fragment.fragment_name][meta_type].extend(meta_data)
137-
return Metadata(gretel_id=gretel_id, fields=merged_fragment, received_at=ts, entities={})
136+
return Metadata(record_id=record_id, fields=merged_fragment, received_at=ts, entities={})
138137

139138

140-
def fragment_for_record(gretel_id: str, fragment_name: str) -> MetadataFragment:
139+
def fragment_for_record(record_id: str, fragment_name: str) -> MetadataFragment:
141140
"""Create a new ``MetadataFragment`` timestamped to the current time."""
142141
epoch = time.time()
143142
ts = datetime.fromtimestamp(epoch).isoformat() + "Z"
144143
return MetadataFragment(
145-
gretel_id=gretel_id,
146-
gretel_fragment_epoch=epoch,
147-
gretel_fragment_ts=ts,
144+
record_id=record_id,
145+
fragment_epoch=epoch,
146+
fragment_ts=ts,
148147
fragment_name=fragment_name,
149148
)
150149

@@ -216,7 +215,7 @@ def predictions_to_dict(
216215
def fragment_from_ner_predictions(
217216
fragment_name: str,
218217
predictions: list[NERPrediction],
219-
gretel_id: str,
218+
record_id: str,
220219
) -> tuple[MetadataFragment, dict]:
221220
"""Build a ``MetadataFragment`` and entity map from NER predictions.
222221
@@ -230,9 +229,9 @@ def fragment_from_ner_predictions(
230229
"""
231230
epoch = time.time()
232231
fragment = MetadataFragment(
233-
gretel_id=gretel_id,
234-
gretel_fragment_ts=datetime.fromtimestamp(epoch).isoformat() + "Z",
235-
gretel_fragment_epoch=epoch,
232+
record_id=record_id,
233+
fragment_ts=datetime.fromtimestamp(epoch).isoformat() + "Z",
234+
fragment_epoch=epoch,
236235
fragment_name=fragment_name,
237236
)
238237
preds_by_field, ent_map = predictions_to_dict(predictions)

src/nemo_safe_synthesizer/data_processing/records/json_record.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,7 @@ def unpack_level(parent_key, parent_val):
6666
return raw
6767

6868

69-
def remove_gretel_array_markers(data: str) -> tuple[str, int, base.ValuePath]:
69+
def remove_array_markers(data: str) -> tuple[str, int, base.ValuePath]:
7070
"""Strip array-position markers from a composite key and build a ``ValuePath``.
7171
7272
Returns:
@@ -91,7 +91,7 @@ def convert_flat_dict_to_kv_pairs(data: dict) -> list[base.KVPair]:
9191
out = []
9292
for k, v in data.items():
9393
k = str(k)
94-
new_key, array_count, value_path = remove_gretel_array_markers(k)
94+
new_key, array_count, value_path = remove_array_markers(k)
9595
flat = base.KVPair(new_key, v, base.get_type_as_string(v), array_count, value_path)
9696
out.append(flat)
9797
return out

src/nemo_safe_synthesizer/pii_replacer/ner/const.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -14,9 +14,9 @@ def __getattr__(self, key):
1414

1515
const = ConstDict(
1616
{
17-
"GRETEL_ID": "_gretel_id",
18-
"GRETEL_TS": "_gretel_ts",
19-
"GRETEL_SUB": "_gretel_subscriber",
17+
"NSS_ID": "_nss_id",
18+
"NSS_TS": "_nss_ts",
19+
"NSS_SUB": "_nss_subscriber",
2020
"TYPE": "type",
2121
"STR_VALUE": "string_value",
2222
"INT_VALUE": "int_value",
@@ -48,7 +48,7 @@ def __getattr__(self, key):
4848
"TRUE": "true",
4949
"FALSE": "false",
5050
"NULL": "null",
51-
"ARRAY_POS": "_gretelarray_",
51+
"ARRAY_POS": "_nssarray_",
5252
"BOOL": "boolean",
5353
"DELIM": ".",
5454
"ENTITY": "entity",

src/nemo_safe_synthesizer/pii_replacer/ner/models.py

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -15,11 +15,11 @@
1515

1616
logger = get_logger(__name__)
1717

18-
DEFAULT_BUCKET = os.getenv("GRETEL_OPT_BUCKET", "gretel-opt-dev-use2")
18+
DEFAULT_BUCKET = os.getenv("NSS_OPT_BUCKET", "nss-opt-dev-use2")
1919
"""Default bucket environment variable. If it's not found, use dev."""
2020

21-
DEFAULT_CACHE_DIR = os.getenv("GRETEL_OPT_CACHE_DIR", ".optcache")
22-
"""``StorageConfig`` default cache directory. Searches the GRETEL_OPT_CACHE_DIR environment
21+
DEFAULT_CACHE_DIR = os.getenv("NSS_OPT_CACHE_DIR", ".optcache")
22+
"""``StorageConfig`` default cache directory. Searches the NSS_OPT_CACHE_DIR environment
2323
for a cache directory. By default it will fallback to `.optcache`. If this is set to
2424
``disabled`` files wont be cached to disk.
2525
"""
@@ -32,10 +32,10 @@ class Visibility(Enum):
3232
"""Packages that are open to the public with a public-read ACL"""
3333

3434
PRIVATE = "priv"
35-
"""Packages available to customers via "paywall" behind gretel api key"""
35+
"""Packages available to customers via "paywall" behind an api key"""
3636

3737
INTERNAL = "int"
38-
"""Only available from internal gretel infrastructure."""
38+
"""Only available from internal infrastructure."""
3939

4040

4141
@dataclass
@@ -95,7 +95,7 @@ class StorageConfig:
9595
def from_system(cls) -> StorageConfig:
9696
"""Return a default ``StorageConfig`` based on a system's environment variables.
9797
98-
By convention, it looks for the environment variable ``GRETEL_OPT_BUCKET`` as
98+
By convention, it looks for the environment variable ``NSS_OPT_BUCKET`` as
9999
the bucket location. The default settings from this function are appropriate
100100
for development without additional configuration.
101101
"""
@@ -113,7 +113,7 @@ def get_cache_manager(storage_config: StorageConfig = None) -> CacheManager:
113113

114114

115115
class CacheManager:
116-
"""Handles downloading model files from the gretel "opt" package repo.
116+
"""Handles downloading model files from the "opt" package repo.
117117
118118
This class will also optionally cache these files to disk. This is useful for
119119
environments with local persistent state such as a local development laptop.

src/nemo_safe_synthesizer/pii_replacer/ner/regexes/age.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111
from ..entity import Entity
1212
from ..regex import Pattern, RegexPredictor
1313

14-
# https://github.com/Gretellabs/monogretel/issues/190
14+
# Relevant issue: detection should support descriptive age terms
1515
HEADERS = ["age", "ages"]
1616

1717

src/nemo_safe_synthesizer/pii_replacer/ner/regexes/sex_gender.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111
create_exact_field_matcher,
1212
)
1313

14-
# https://github.com/Gretellabs/monogretel/issues/190
14+
# Relevant issue: detection should support sex/gender header patterns
1515
SEX_HEADERS = [
1616
create_exact_field_matcher("sex"),
1717
create_exact_field_matcher("sexo"),

tests/evaluation/conftest.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ def make_df(seed: int, n: int = 100):
2727
{
2828
"num": [random.random() for _ in range(n)],
2929
"num_Int64": [random.randint(1, 100) for _ in range(n)],
30-
# Categorical columns according to gretel core arfifact classifier
30+
# Categorical columns according to core artifact classifier
3131
"num_cat": [random.randint(1, 4) for _ in range(n)],
3232
"num_cat_Int64": [random.randint(1, 4) for _ in range(n)],
3333
"small_cat": [random.choice(["foo", "bar", "baz", "biff", "barf"]) for _ in range(n)],

0 commit comments

Comments
 (0)