Possible bug: the random value generator used in svm_binary_svc_probability() function will not work well when training data size is large.

In svm_binary_svc_probability() function, random shuffle is applied on the train data before it is used in the 5-fold cross-validation process. The random shuffle is realized by the following codes:

for(i=0;i<prob->l;i++) perm[i]=i;
for(i=0;i<prob->l;i++)
{
	int j = i+rand()%(prob->l-i);
	swap(perm[i],perm[j]);
}

The C++ rand() function in the codes returns a random number in the range between 0 and RAND_MAX. Normally, RAND_MAX is 32767 (on my PC, windows, x64-based processor, RAND_MAX is also this value). So if prob->l-i is larger than RAND_MAX, the codes above can only shuffle index between 0 and RAND_MAX. I noticed that the train data input svm_problem *prob of the function svm_binary_svc_probability() had already been sorted by the data label (+1, -1 for binary classification), so the first part of prob->y[i] are for label being +1. If the number of train data with label being +1 is above RAND_MAX, in the 5-fold cross-validation, the first "predicting data set" will probably be the ones all with label +1. This will create weird results for estimating probA and probB.

So I suggest using the random function from William H. Press, et al.,
Numerical Recipes in C, which can return a random float value between 0 and 1. And another question is, in  svm_binary_svc_probability() function, why not using stratified shuffle as it is used in svm_cross_validation() function?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Possible bug: the random value generator used in svm_binary_svc_probability() function will not work well when training data size is large. #103

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Possible bug: the random value generator used in svm_binary_svc_probability() function will not work well when training data size is large. #103

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions