-
Notifications
You must be signed in to change notification settings - Fork 386
Description
Description
When running the example notebook sdgx_example_ctgan.ipynb, I ran into error in the fit step.
Reproduce
Follow the https://github.com/hitsz-ids/synthetic-data-generator/blob/main/example/sdgx_example_ctgan.ipynb
All cells work fine until the synthesizer.fit() step.
I got the following error
Expected behavior
Context
- Operating System and version: macOS 14.7.1
- Which version are you using: 0.2.4
Error message
{
"name": "TypeError",
"message": "Could not convert string '2015-06-012010-10-012016-08-012013-05-012017-04-012016-08-012015-07-012016-07-012012-08-012017-02-012016-11-012015-04-012015-03-012015-08-012017-04-012015-08-012015-11-012016-10-012016-11-012016-01-012016-02-012015-03-012014-06-012017-08-012014-05-012015-08-012011-05-012016-09-012012-10-012015-01-012016-06-012015-08-012013-03-012016-03-012018-06-012017-11-012018-03-012011-10-012016-07-012014-07-012016-04-012013-05-012016-02-012015-03-012014-09-012015-09-012015-04-012016-01-012013-12-012014-10-012017-05-012016-06-012016-07-012016-01-012017-08-012016-03-012018-09-012015-11-012015-03-012015-02-012017-08-012016-07-012016-01-01......' to numeric",
"stack": "---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[13], line 2
1 # Fit the model
----> 2 synthesizer.fit()
File ~/.pyenv/versions/3.10.15/envs/adhoc/lib/python3.10/site-packages/sdgx/synthesizer.py:327, in Synthesizer.fit(self, metadata, inspector_max_chunk, metadata_include_inspectors, metadata_exclude_inspectors, inspector_init_kwargs, model_fit_kwargs)
325 try:
326 logger.info("Model fit Started...")
--> 327 self.model.fit(metadata, processed_dataloader, **(model_fit_kwargs or {}))
328 logger.info("Model fit... Finished")
329 finally:
File ~/.pyenv/versions/3.10.15/envs/adhoc/lib/python3.10/site-packages/sdgx/models/ml/single_table/ctgan.py:220, in CTGANSynthesizerModel.fit(self, metadata, dataloader, epochs, *args, **kwargs)
218 if epochs is not None:
219 self._epochs = epochs
--> 220 self._pre_fit(dataloader, discrete_columns, metadata)
221 if self.fit_data_empty:
222 logger.info("CTGAN fit finished because of empty df detected.")
File ~/.pyenv/versions/3.10.15/envs/adhoc/lib/python3.10/site-packages/sdgx/models/ml/single_table/ctgan.py:242, in CTGANSynthesizerModel._pre_fit(self, dataloader, discrete_columns, metadata)
240 self._transformer = DataTransformer(metadata=metadata)
241 logger.info("Fitting model's transformer...")
--> 242 self._transformer.fit(dataloader, discrete_columns)
243 logger.info("Transforming data...")
244 self._ndarry_loader = self._transformer.transform(dataloader)
File ~/.pyenv/versions/3.10.15/envs/adhoc/lib/python3.10/site-packages/sdgx/models/components/optimize/sdv_ctgan/data_transformer.py:184, in DataTransformer.fit(self, data_loader, discrete_columns)
182 else:
183 logger.debug(f"Fitting continuous column {column_name}...")
--> 184 column_transform_info = self._fit_continuous(data_loader[[column_name]])
186 self.output_info_list.append(column_transform_info.output_info)
187 self.output_dimensions += column_transform_info.output_dimensions
File ~/.pyenv/versions/3.10.15/envs/adhoc/lib/python3.10/site-packages/sdgx/models/components/optimize/sdv_ctgan/data_transformer.py:98, in DataTransformer._fit_continuous(self, data)
96 column_name = data.columns[0]
97 gm = ClusterBasedNormalizer(model_missing_values=True, max_clusters=min(len(data), 10))
---> 98 gm.fit(data, column_name)
99 num_components = sum(gm.valid_component_indicator)
101 return ColumnTransformInfo(
102 column_name=column_name,
103 column_type="continuous",
(...)
106 output_dimensions=1 + num_components,
107 )
File ~/.pyenv/versions/3.10.15/envs/adhoc/lib/python3.10/site-packages/sdgx/models/components/sdv_rdt/transformers/base.py:241, in BaseTransformer.fit(self, data, column)
238 self._store_columns(column, data)
240 columns_data = self._get_columns_data(data, self.columns)
--> 241 self._fit(columns_data)
243 self._build_output_columns(data)
File ~/.pyenv/versions/3.10.15/envs/adhoc/lib/python3.10/site-packages/sdgx/models/components/sdv_rdt/transformers/numerical.py:479, in ClusterBasedNormalizer._fit(self, data)
466 """Fit the transformer to the data.
467
468 Args:
469 data (pandas.Series):
470 Data to fit to.
471 """
472 self._bgm_transformer = BayesianGaussianMixture(
473 n_components=self.max_clusters,
474 weight_concentration_prior_type="dirichlet_process",
475 weight_concentration_prior=0.001,
476 n_init=1,
477 )
--> 479 super()._fit(data)
480 data = super()._transform(data)
481 if data.ndim > 1:
File ~/.pyenv/versions/3.10.15/envs/adhoc/lib/python3.10/site-packages/sdgx/models/components/sdv_rdt/transformers/numerical.py:176, in FloatFormatter._fit(self, data)
171 self._rounding_digits = self._learn_rounding_digits(data)
173 self.null_transformer = NullTransformer(
174 self.missing_value_replacement, self.model_missing_values
175 )
--> 176 self.null_transformer.fit(data)
File ~/.pyenv/versions/3.10.15/envs/adhoc/lib/python3.10/site-packages/sdgx/models/components/sdv_rdt/transformers/null.py:79, in NullTransformer.fit(self, data)
76 null_values = data.isna().to_numpy()
77 self.nulls = null_values.any()
---> 79 self._missing_value_replacement = self._get_missing_value_replacement(data)
80 if not self.nulls and self._model_missing_values:
81 self._model_missing_values = False
File ~/.pyenv/versions/3.10.15/envs/adhoc/lib/python3.10/site-packages/sdgx/models/components/sdv_rdt/transformers/null.py:60, in NullTransformer._get_missing_value_replacement(self, data)
57 return None
59 if self._missing_value_replacement == "mean":
---> 60 return data.mean()
62 if self._missing_value_replacement == "mode":
63 return data.mode(dropna=True)[0]
File ~/.pyenv/versions/3.10.15/envs/adhoc/lib/python3.10/site-packages/pandas/core/series.py:6549, in Series.mean(self, axis, skipna, numeric_only, **kwargs)
6541 @doc(make_doc("mean", ndim=1))
6542 def mean(
6543 self,
(...)
6547 **kwargs,
6548 ):
-> 6549 return NDFrame.mean(self, axis, skipna, numeric_only, **kwargs)
File ~/.pyenv/versions/3.10.15/envs/adhoc/lib/python3.10/site-packages/pandas/core/generic.py:12420, in NDFrame.mean(self, axis, skipna, numeric_only, **kwargs)
12413 def mean(
12414 self,
12415 axis: Axis | None = 0,
(...)
12418 **kwargs,
12419 ) -> Series | float:
12420 return self._stat_function(
12421 "mean", nanops.nanmean, axis, skipna, numeric_only, **kwargs
12422 )
File ~/.pyenv/versions/3.10.15/envs/adhoc/lib/python3.10/site-packages/pandas/core/generic.py:12377, in NDFrame._stat_function(self, name, func, axis, skipna, numeric_only, **kwargs)
12373 nv.validate_func(name, (), kwargs)
12375 validate_bool_kwarg(skipna, "skipna", none_allowed=False)
12377 return self._reduce(
12378 func, name=name, axis=axis, skipna=skipna, numeric_only=numeric_only
12379 )
File ~/.pyenv/versions/3.10.15/envs/adhoc/lib/python3.10/site-packages/pandas/core/series.py:6457, in Series._reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
6452 # GH#47500 - change to TypeError to match other methods
6453 raise TypeError(
6454 f"Series.{name} does not allow {kwd_name}={numeric_only} "
6455 "with non-numeric dtypes."
6456 )
-> 6457 return op(delegate, skipna=skipna, **kwds)
File ~/.pyenv/versions/3.10.15/envs/adhoc/lib/python3.10/site-packages/pandas/core/nanops.py:147, in bottleneck_switch.call..f(values, axis, skipna, **kwds)
145 result = alt(values, axis=axis, skipna=skipna, **kwds)
146 else:
--> 147 result = alt(values, axis=axis, skipna=skipna, **kwds)
149 return result
File ~/.pyenv/versions/3.10.15/envs/adhoc/lib/python3.10/site-packages/pandas/core/nanops.py:404, in _datetimelike_compat..new_func(values, axis, skipna, mask, **kwargs)
401 if datetimelike and mask is None:
402 mask = isna(values)
--> 404 result = func(values, axis=axis, skipna=skipna, mask=mask, **kwargs)
406 if datetimelike:
407 result = _wrap_results(result, orig_values.dtype, fill_value=iNaT)
File ~/.pyenv/versions/3.10.15/envs/adhoc/lib/python3.10/site-packages/pandas/core/nanops.py:720, in nanmean(values, axis, skipna, mask)
718 count = _get_counts(values.shape, mask, axis, dtype=dtype_count)
719 the_sum = values.sum(axis, dtype=dtype_sum)
--> 720 the_sum = _ensure_numeric(the_sum)
722 if axis is not None and getattr(the_sum, "ndim", False):
723 count = cast(np.ndarray, count)
File ~/.pyenv/versions/3.10.15/envs/adhoc/lib/python3.10/site-packages/pandas/core/nanops.py:1701, in _ensure_numeric(x)
1698 elif not (is_float(x) or is_integer(x) or is_complex(x)):
1699 if isinstance(x, str):
1700 # GH#44008, GH#36703 avoid casting e.g. strings to numeric
-> 1701 raise TypeError(f"Could not convert string '{x}' to numeric")
1702 try:
1703 x = float(x)
TypeError: Could not convert string '2015-06-012010-10-012016-08-012013-05-012017-04-012016-08-012015-07-012016-07-012012-08-012017-02-012016-11-012015-04-012015-03-012015-08-012017-04-012015-08-012015-11-012016-10-012016-11-012016-01-012016-02-012015-03-012014-06-012017-08-012014-05-012015-08-012011-05-012016-09-012012-10-012015-01-012016-06-012015-08-012013-03-012016-03-012018-06-012017-11-012018-03-012011-10-012016-07-012014-07-012016-04-012013-05-012016-02-012015-03-012014-09-012015-09-012015-04-012016-01-012013-12-012014-10-012017-05-012016-06-012016-07-012016-01-012017-08-012016-03-012018-09-012015-11-012015-03-012015-02-012017-08-012016-07-012016-01-012015-07-012016-03-012014-07-012013-02-012014-06-012014-06-012014-10-012015-11-012015-01-012015-08-012015......' to numeric"
}
Configuration
Paste the contents of your configuration file here.
Additional context
The string in the error message is too long to fit in an github issue, so I shorten the date string a bit.