Open
Description
In stepping through fit
for NeuraNetworkRegressor, using the data at the top of the test file regressors.jl
, I am getting some unexpected behaviour.
Here is a minimal version of that data giving the same behaviour:
using MLJBase, MLJFlux, Tables
X = (
; Column2 = categorical(repeat(['a', 'b', 'c'], 10)),
Column3 = categorical(repeat(["b", "c", "d"], 10), ordered = true),
)
y = rand(Float32, 30)
schema(X)
# ┌─────────┬──────────────────┬──────────────────────────────────┐
# │ names │ scitypes │ types │
# ├─────────┼──────────────────┼──────────────────────────────────┤
# │ Column2 │ Multiclass{5} │ CategoricalValue{Char, UInt32} │
# │ Column3 │ OrderedFactor{4} │ CategoricalValue{String, UInt32} │
# └─────────┴──────────────────┴──────────────────────────────────┘
And the model:
model = NeuralNetworkRegressor()
Okay, now the following lines are copied from fit
, as given in "src/mlj_model_iinterface.jl" on the dev branch:
# Get input properties
shape = MLJFlux.shape(model, X, y)
cat_inds = MLJFlux.get_cat_inds(X)
pure_continuous_input = isempty(cat_inds)
# Decide whether to enable entity embeddings (e.g., ImageClassifier won't)
enable_entity_embs = MLJFlux.is_embedding_enabled(model) && !pure_continuous_input
# Prepare entity embeddings inputs and encode X if entity embeddings enabled
featnames = []
if enable_entity_embs
X = MLJFlux.convert_to_table(X)
featnames = Tables.schema(X).names
end
# entityprops is (index = cat_inds[i], levels = num_levels[i], newdim = newdims[i])
# for each categorical feature
default_embedding_dims = enable_entity_embs ? model.embedding_dims : Dict{Symbol, Real}()
entityprops, entityemb_output_dim =
MLJFlux.prepare_entityembs(X, featnames, cat_inds, default_embedding_dims)
X, ordinal_mappings = MLJFlux.ordinal_encoder_fit_transform(X; featinds = cat_inds)
At this point I expect X
to have Continuous
scitype - no more categoricals. However:
schema(X)
# ┌─────────┬──────────────────┬─────────────────────────────────────────┐
# │ names │ scitypes │ types │
# ├─────────┼──────────────────┼─────────────────────────────────────────┤
# │ Column2 │ Multiclass{3} │ CategoricalValue{AbstractFloat, UInt32} │
# │ Column3 │ OrderedFactor{3} │ CategoricalValue{AbstractFloat, UInt32} │
# └─────────┴──────────────────┴─────────────────────────────────────────┘
The raw element type is Float32
but these are getting wrapped as categorical vectors.
typeof(X.Column2)
CategoricalVector{AbstractFloat, UInt32, AbstractFloat, CategoricalValue{AbstractFloat, UInt32}, Union{}} (alias for CategoricalArray{AbstractFloat, 1, UInt32, AbstractFloat, CategoricalValue{AbstractFloat, UInt32}, Union{}})