Description
System Information (please complete the following information):
- OS & Version: Windows 11
- ML.NET Version: ML.NET 1.6.0
- .NET Version: .NET 6.0
Describe the bug
Dataset may include long column names. In my case, they are about 150 characters long. Name includes a-Z 0-9 and the dash character. ColumnInference reports them correctly. However, after starting training with AutoML, the columns are not used for training. If there are only long titles error about missing "Features" column is thrown.
To Reproduce
Steps to reproduce the behavior:
- Create dataset with long column names (numeric in my case)
- Column inference reports them correctly:
ColumnInferenceResults columnInference = mlContext.Auto().InferColumns(TrainDataPath, LabelColumnName, groupColumns: false);
- Train:
experimentResult = experiment.Execute(TrainDataView, ValidationDataView, columnInformation, null, progressHandler);
- Observe exception about missing Features.
- Rename columns to shorter manually, or in a loop to confirm training now works. This can be also used as a workaround for now.
var copyPipeline= mlContext.Transforms.CopyColumns("col" + i, col.Name);
OriginalTrainDataView = pipeline.Fit(OriginalTrainDataView).Transform(OriginalTrainDataView);
Note: I have tree-based algorithms enabled.
Expected behavior
Long column names should be trained normally.
If not possible, an exception should be received. Now user might think all data is being used to train but actually some columns may be ignored.
It is possible Verbose level would give information about this, but it is disabled by default in AutoML. I did not run separately with verbose output.
Additional data
There may be many reasons why dataset could include long column names. For example, they may have name, id and settings of a measurement device included in the column name.
If possible, I'd like to know what is currently the column length limit even if this would be fixed. That helps know which fields have been ignored in earlier models.