Open
Description
I am trying to run a RF regression based on my dataset. My dataframe looks like this [below is only the first 5 rows]:
target feat_1 \
county year
County_1 2000 0.047879 [0, 2, 10, 9, 12, 10, 9, 20, 35, 51, 0, 0, 0, ...
2001 -0.112184 [0, 1, 0, 2, 1, 2, 4, 9, 18, 34, 0, 1, 0, 1, 1...
2002 0.060659 [0, 0, 0, 0, 3, 24, 33, 32, 42, 58, 0, 0, 0, 2...
2003 0.098047 [0, 0, 1, 5, 13, 22, 40, 38, 29, 42, 0, 0, 0, ...
2004 -0.053559 [0, 1, 0, 2, 6, 8, 14, 33, 34, 64, 0, 0, 1, 1,...
feat_2
county year
County_1 2000 [1.8121698113207556, 0.938584905660378, -0.568...
2001 [2.6941509433962274, 3.888301886792455, 2.8169...
2002 [-3.4043396226415084, -3.458113207547169, -3.5...
2003 [-1.9566037735849044, -2.3393396226415084, -2....
2004 [-3.2046226415094323, -3.502075471698112, -2.9...
When running the code to do the regression using only feat_1
, the code works perfectly fine.
FEATURES_IN = ['feat_1']
learner = ydf.RandomForestLearner(label='target', features=FEATURES_IN, task=ydf.Task.REGRESSION)
model = learner.train(df_train)
[it works well. I can use model.describe()
later to read the model]
However, as I include feat_2
in the regression,
FEATURES_IN = ['feat_1', 'feat_2']
learner = ydf.RandomForestLearner(label='target', features=FEATURES_IN, task=ydf.Task.REGRESSION)
model = learner.train(df_train)
It raises the error:
ValueError: Cannot import column 'feat_2' with semantic=Semantic.CATEGORICAL_SET as it contains floating point values.
Note: If the column is a label, make sure the correct task is selected. For example, you cannot train a classification model (task=ydf.Task.CLASSIFICATION) with floating point labels.
In this case, I am not sure how to override the feature semantic for multi-dimensional features. I could not find it in your documentation. I tried to use
FEATURES_IN = [
ydf.Feature("feat_1", ydf.Semantic.NUMERICAL),
ydf.Feature("feat_2", ydf.Semantic.NUMERICAL),
]
learner = ydf.RandomForestLearner(label='target', features=FEATURES_IN, task=ydf.Task.REGRESSION)
model = learner.train(df_train)
but it seems not to work:
ValueError: Cannot convert NUMERICAL column 'feat_1' of type numpy's array of 'object' and with content=array([array([ 0, 2, 10, 9, 12, 10, 9, 20, 35, 51, 0, 0, 0,
0, 0, 3, 7, 13, 12, 36, 0, 0, 0, 4, 0, 6,
10, 11, 36, 63, 0, 0, 1, 0, 3, 8, 27, 34, 60,
93, 0, 0, 0, 0, 0, 3, 8, 9, 25, 18, 0, 0,
0, 0, 0, 4, 2, 13, 11, 15, 0, 3, 30, 179, 159,
102, 87, 85, 60, 68]) ,
array([ 0, 1, 0, 2, 1, 2, 4, 9, 18, 34, 0, 1, 0, 1, 1, 5, 8,
30, 44, 67, 0, 0, 0, 1, 0, 2, 13, 26, 33, 63, 0, 0, 0, 0,
0, 0, 3, 13, 21, 27, 0, 0, 0, 0, 1, 1, 4, 6, 11, 25, 0,
0, 1, 2, 4, 1, 9, 20, 40, 41, 1, 1, 2, 6, 4, 10, 19, 25,
18, 47])
[The error message prints the whole array, which is way too long. I omit the contents in the middle here.]
array([ 136, 2545, 11700, 18486, 15007, 10840, 7356, 5265, 5448,
5890, 84, 140, 119, 156, 260, 646, 1778, 2549,
2890, 2992, 0, 1, 3, 8, 17, 20, 72,
91, 151, 179, 0, 0, 1, 4, 5, 12,
16, 18, 68, 98, 2, 1, 0, 0, 7,
16, 15, 21, 46, 94, 0, 0, 2, 11,
12, 23, 16, 34, 67, 117, 4, 10, 25,
34, 31, 30, 43, 66, 97, 180]) ],
dtype=object) to np.float32 values.
Note: If the column is a label, make sure the training task is compatible. For example, you cannot train a regression model (task=ydf.Task.REGRESSION) on a string column.
In this case, I am not sure how to deal with it.
Metadata
Assignees
Labels
No labels