Skip to content

How can I override the feature semantic for multi-dimensional features? #145

Open
@NashShuai

Description

I am trying to run a RF regression based on my dataset. My dataframe looks like this [below is only the first 5 rows]:

              target                                          feat_1  \
county  year                                                                 
County_1 2000   0.047879  [0, 2, 10, 9, 12, 10, 9, 20, 35, 51, 0, 0, 0, ...   
        2001  -0.112184  [0, 1, 0, 2, 1, 2, 4, 9, 18, 34, 0, 1, 0, 1, 1...   
        2002   0.060659  [0, 0, 0, 0, 3, 24, 33, 32, 42, 58, 0, 0, 0, 2...   
        2003   0.098047  [0, 0, 1, 5, 13, 22, 40, 38, 29, 42, 0, 0, 0, ...   
        2004  -0.053559  [0, 1, 0, 2, 6, 8, 14, 33, 34, 64, 0, 0, 1, 1,...   

                                                      feat_2  
county  year                                                     
County_1 2000  [1.8121698113207556, 0.938584905660378, -0.568...  
        2001  [2.6941509433962274, 3.888301886792455, 2.8169...  
        2002  [-3.4043396226415084, -3.458113207547169, -3.5...  
        2003  [-1.9566037735849044, -2.3393396226415084, -2....  
        2004  [-3.2046226415094323, -3.502075471698112, -2.9...  

When running the code to do the regression using only feat_1, the code works perfectly fine.

FEATURES_IN = ['feat_1']
learner = ydf.RandomForestLearner(label='target', features=FEATURES_IN, task=ydf.Task.REGRESSION)
model = learner.train(df_train)

[it works well. I can use model.describe() later to read the model]

However, as I include feat_2 in the regression,

FEATURES_IN = ['feat_1', 'feat_2']
learner = ydf.RandomForestLearner(label='target', features=FEATURES_IN, task=ydf.Task.REGRESSION)
model = learner.train(df_train)

It raises the error:

ValueError: Cannot import column 'feat_2' with semantic=Semantic.CATEGORICAL_SET as it contains floating point values.
Note: If the column is a label, make sure the correct task is selected. For example, you cannot train a classification model (task=ydf.Task.CLASSIFICATION) with floating point labels.

In this case, I am not sure how to override the feature semantic for multi-dimensional features. I could not find it in your documentation. I tried to use

FEATURES_IN = [
    ydf.Feature("feat_1", ydf.Semantic.NUMERICAL),
    ydf.Feature("feat_2", ydf.Semantic.NUMERICAL),
]
learner = ydf.RandomForestLearner(label='target', features=FEATURES_IN, task=ydf.Task.REGRESSION)
model = learner.train(df_train)

but it seems not to work:

ValueError: Cannot convert NUMERICAL column 'feat_1' of type numpy's array of 'object' and with content=array([array([  0,   2,  10,   9,  12,  10,   9,  20,  35,  51,   0,   0,   0,
                0,   0,   3,   7,  13,  12,  36,   0,   0,   0,   4,   0,   6,
               10,  11,  36,  63,   0,   0,   1,   0,   3,   8,  27,  34,  60,
               93,   0,   0,   0,   0,   0,   3,   8,   9,  25,  18,   0,   0,
                0,   0,   0,   4,   2,  13,  11,  15,   0,   3,  30, 179, 159,
              102,  87,  85,  60,  68])                                       ,
       array([ 0,  1,  0,  2,  1,  2,  4,  9, 18, 34,  0,  1,  0,  1,  1,  5,  8,
              30, 44, 67,  0,  0,  0,  1,  0,  2, 13, 26, 33, 63,  0,  0,  0,  0,
               0,  0,  3, 13, 21, 27,  0,  0,  0,  0,  1,  1,  4,  6, 11, 25,  0,
               0,  1,  2,  4,  1,  9, 20, 40, 41,  1,  1,  2,  6,  4, 10, 19, 25,
              18, 47])                                                  
[The error message prints the whole array, which is way too long. I omit the contents in the middle here.]
      array([  136,  2545, 11700, 18486, 15007, 10840,  7356,  5265,  5448,
               5890,    84,   140,   119,   156,   260,   646,  1778,  2549,
               2890,  2992,     0,     1,     3,     8,    17,    20,    72,
                 91,   151,   179,     0,     0,     1,     4,     5,    12,
                 16,    18,    68,    98,     2,     1,     0,     0,     7,
                 16,    15,    21,    46,    94,     0,     0,     2,    11,
                 12,    23,    16,    34,    67,   117,     4,    10,    25,
                 34,    31,    30,    43,    66,    97,   180])             ],
      dtype=object) to np.float32 values.
Note: If the column is a label, make sure the training task is compatible. For example, you cannot train a regression model (task=ydf.Task.REGRESSION) on a string column.

In this case, I am not sure how to deal with it.

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions