Skip to content

Long Column names are unexpectedly dropped in training #6045

Open
@torronen

Description

@torronen

System Information (please complete the following information):

  • OS & Version: Windows 11
  • ML.NET Version: ML.NET 1.6.0
  • .NET Version: .NET 6.0

Describe the bug
Dataset may include long column names. In my case, they are about 150 characters long. Name includes a-Z 0-9 and the dash character. ColumnInference reports them correctly. However, after starting training with AutoML, the columns are not used for training. If there are only long titles error about missing "Features" column is thrown.

To Reproduce
Steps to reproduce the behavior:

  1. Create dataset with long column names (numeric in my case)
  2. Column inference reports them correctly:
    ColumnInferenceResults columnInference = mlContext.Auto().InferColumns(TrainDataPath, LabelColumnName, groupColumns: false);
  3. Train:
    experimentResult = experiment.Execute(TrainDataView, ValidationDataView, columnInformation, null, progressHandler);
  4. Observe exception about missing Features.
  5. Rename columns to shorter manually, or in a loop to confirm training now works. This can be also used as a workaround for now.
var copyPipeline= mlContext.Transforms.CopyColumns("col" + i, col.Name);
OriginalTrainDataView = pipeline.Fit(OriginalTrainDataView).Transform(OriginalTrainDataView);

Note: I have tree-based algorithms enabled.

Expected behavior
Long column names should be trained normally.
If not possible, an exception should be received. Now user might think all data is being used to train but actually some columns may be ignored.

It is possible Verbose level would give information about this, but it is disabled by default in AutoML. I did not run separately with verbose output.

Additional data
There may be many reasons why dataset could include long column names. For example, they may have name, id and settings of a measurement device included in the column name.

If possible, I'd like to know what is currently the column length limit even if this would be fixed. That helps know which fields have been ignored in earlier models.

Metadata

Metadata

Assignees

No one assigned

    Labels

    AutoML.NETAutomating various steps of the machine learning processAwaiting User InputAwaiting author to supply further info (data, model, repro). Will close issue if no more info given.P2Priority of the issue for triage purpose: Needs to be fixed at some point.

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions