How can I override the feature semantic for multi-dimensional features?

I am trying to run a RF regression based on my dataset. My dataframe looks like this [below is only the first 5 rows]:
```
              target                                          feat_1  \
county  year                                                                 
County_1 2000   0.047879  [0, 2, 10, 9, 12, 10, 9, 20, 35, 51, 0, 0, 0, ...   
        2001  -0.112184  [0, 1, 0, 2, 1, 2, 4, 9, 18, 34, 0, 1, 0, 1, 1...   
        2002   0.060659  [0, 0, 0, 0, 3, 24, 33, 32, 42, 58, 0, 0, 0, 2...   
        2003   0.098047  [0, 0, 1, 5, 13, 22, 40, 38, 29, 42, 0, 0, 0, ...   
        2004  -0.053559  [0, 1, 0, 2, 6, 8, 14, 33, 34, 64, 0, 0, 1, 1,...   

                                                      feat_2  
county  year                                                     
County_1 2000  [1.8121698113207556, 0.938584905660378, -0.568...  
        2001  [2.6941509433962274, 3.888301886792455, 2.8169...  
        2002  [-3.4043396226415084, -3.458113207547169, -3.5...  
        2003  [-1.9566037735849044, -2.3393396226415084, -2....  
        2004  [-3.2046226415094323, -3.502075471698112, -2.9...  
```

When running the code to do the regression using only `feat_1`, the code works perfectly fine.
```py
FEATURES_IN = ['feat_1']
learner = ydf.RandomForestLearner(label='target', features=FEATURES_IN, task=ydf.Task.REGRESSION)
model = learner.train(df_train)
```
[it works well. I can use `model.describe()` later to read the model]

However, as I include `feat_2` in the regression,
```py
FEATURES_IN = ['feat_1', 'feat_2']
learner = ydf.RandomForestLearner(label='target', features=FEATURES_IN, task=ydf.Task.REGRESSION)
model = learner.train(df_train)
```
It raises the error:
```
ValueError: Cannot import column 'feat_2' with semantic=Semantic.CATEGORICAL_SET as it contains floating point values.
Note: If the column is a label, make sure the correct task is selected. For example, you cannot train a classification model (task=ydf.Task.CLASSIFICATION) with floating point labels.
```

In this case, I am not sure how to override the feature semantic for multi-dimensional features. I could not find it in your documentation. I tried to use
```py
FEATURES_IN = [
    ydf.Feature("feat_1", ydf.Semantic.NUMERICAL),
    ydf.Feature("feat_2", ydf.Semantic.NUMERICAL),
]
learner = ydf.RandomForestLearner(label='target', features=FEATURES_IN, task=ydf.Task.REGRESSION)
model = learner.train(df_train)
```
but it seems not to work:
```
ValueError: Cannot convert NUMERICAL column 'feat_1' of type numpy's array of 'object' and with content=array([array([  0,   2,  10,   9,  12,  10,   9,  20,  35,  51,   0,   0,   0,
                0,   0,   3,   7,  13,  12,  36,   0,   0,   0,   4,   0,   6,
               10,  11,  36,  63,   0,   0,   1,   0,   3,   8,  27,  34,  60,
               93,   0,   0,   0,   0,   0,   3,   8,   9,  25,  18,   0,   0,
                0,   0,   0,   4,   2,  13,  11,  15,   0,   3,  30, 179, 159,
              102,  87,  85,  60,  68])                                       ,
       array([ 0,  1,  0,  2,  1,  2,  4,  9, 18, 34,  0,  1,  0,  1,  1,  5,  8,
              30, 44, 67,  0,  0,  0,  1,  0,  2, 13, 26, 33, 63,  0,  0,  0,  0,
               0,  0,  3, 13, 21, 27,  0,  0,  0,  0,  1,  1,  4,  6, 11, 25,  0,
               0,  1,  2,  4,  1,  9, 20, 40, 41,  1,  1,  2,  6,  4, 10, 19, 25,
              18, 47])                                                  
[The error message prints the whole array, which is way too long. I omit the contents in the middle here.]
      array([  136,  2545, 11700, 18486, 15007, 10840,  7356,  5265,  5448,
               5890,    84,   140,   119,   156,   260,   646,  1778,  2549,
               2890,  2992,     0,     1,     3,     8,    17,    20,    72,
                 91,   151,   179,     0,     0,     1,     4,     5,    12,
                 16,    18,    68,    98,     2,     1,     0,     0,     7,
                 16,    15,    21,    46,    94,     0,     0,     2,    11,
                 12,    23,    16,    34,    67,   117,     4,    10,    25,
                 34,    31,    30,    43,    66,    97,   180])             ],
      dtype=object) to np.float32 values.
Note: If the column is a label, make sure the training task is compatible. For example, you cannot train a regression model (task=ydf.Task.REGRESSION) on a string column.
```

In this case, I am not sure how to deal with it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How can I override the feature semantic for multi-dimensional features? #145

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How can I override the feature semantic for multi-dimensional features? #145

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions