-
Notifications
You must be signed in to change notification settings - Fork 1
KCMeans
(syntaxis changed as of version 1.2)
KCMeans stands for "Kernel Probability Density Estimator c-means". It is a clustering algorithm based on kernel density estimator. For more information, please see the following reference:
A Novel Neural Network Technique for Analysis and Classification of EM Single-Particle Images
A. Pascual-Montano, L. E. Donate, M. Valle, M. Bárcena, R. D. Pascual-Marqui, J. M. Carazo
Journal of Structural Biology, Vol. 133, No. 2/3, Feb 2001, pp. 233-245
$ classify_kcmeans ...
Parameters
- ``The input data file (raw file). It should be a text file with each row representing the data items and each column representing the variables:
3 1000 12 34 54 -12 45 76 ... 32 45 76
The first line indicates the dimension of the vectors (in this case 3) and the number of vectors (in this case 1000). Please note that vector components (variables) are separated by empty spaces. Additionally, the last column can also be used as a label for the vector. Example:
3 1000 12 34 54 labelA -12 45 76 labelB ... 32 45 76 labelN
- `` The output cluster centers. This parameter will set the base name for the generated output files. KCMeans produces several files with different information and all of them will use this name but with different extensions. The generated files will be:
-
basename.codResulting code vectors. The generated code vectors also follows the same format as the input data, except that a few extra information is also stored in the first line of the file. Example:
-
3 rect 5 1 gaussian 11 31 52 labelA -10 43 71 labelB ... 29 39 71 labelN
The first line first indicates the dimension of the vectors (in this case 3) , the rest of the informaiton included in that line makes no sense for this algorithm but it is necessary in order to be compatible compatible with the Kohonen's SOM_PAKHttp://www.cis.hut.fi/research/som_lvq_pak.shtmlPackage.
-
basename.infInformation file about the parameters used and the resulting Kernel width and quantification error. It will look like this:
Kernel Probability Density Estimator Clustering Algorithm Kernel c-Means Input data file : test.dat Code vectors output file : test.cod Whole codebook output file : test.cbk Algorithm information output file : test.inf Number of feature vectors: 93 Number of variables: 3 Number of clusters = 5 Input data not normalized Gaussian Kernel function Total number of iterations = 200 Stopping criteria (eps) = 1e-07 Final Sigma = 0.272525 Quantization error : 0.3627
-
basename.hisInformation about the number of input vectors assigned to each code vector. It is like an histogram of the resulting code vectors. The file contains two columns: the first column is the number of the code vector and the second column is the number of input vectors assigned to it -
basename.errAverage quantization error for each code vector. The file contains two columns: the first column is the number of the code vector and the second column is the average quantization error for each codevector. -
basename.vsOriginal input vectors assigment for each code vector (contains the classification information) - `` The input code vectors file. This parameter is optional and it is useful when the code vectors are going to be initialized with a set of predefined values. Usually when a several runs of the algorithm are going to be used and the output of one run is going to be used as input to the next one
- `` Save a file for each code vector with a list of the input items that were assigned to it. It will generate a file for each codevector containing a list of the indexes of the input vectors assigned to it. Example: If 5 clusters are used, then 5 files named
[basename].[Codevector Index](`[baseneme.0]`,`[basename.1]`, etc) will be generated - `` Number of clusters
- `` Gaussian Kernel Function (default)
- `` t-Student Kernel Function
- `` If tStudent is used, then it provides the degrees of freedom (Default = 3) Since KCMeans is based on the Kernel Probability Density Estimator, the Kernel function used here is very important. In this version of the program two kernels are supported: Gaussian and TStudent. In case TStudent is used, the degrees of freedoms should also be suplied (3 seems to be a reasonable value and it is set by default)
- `` This parameter will set the number of iterations used in the algorithm. By default 200 is used
- `` Stopping criteria. This means that the algorithm will stop when the sigma value (kernel width) doesn't vary more than
epsbetween iterations or when the number of iteration steps are reached. By default a value of 1e-7 is used. - `` If this parameter is used (by default is not used) the input data will be normalized
- `` Information level that is given as output while running:
- `` No information (default)
- `` Progress bar with the elapsed time and estimated time to finish
- `` Sigma changes between iterations
Example 1: Use Kernel c-means to cluster a set of data stored in "test.dat" file into a 5 clusters
$ classify_kcmeans -i test.dat -o test -c 5 -verb 1 -saveclusters
In this case the following parameters are set by default:
Input data file : test.dat
Output file name : test
Number of clusters = 5
Gaussian Kernel function
Total number of iterations = 200
Stopping criteria (eps) = 1e-07
verbosity level = 1
Do not normalize input data
In this case we are going to generate 5 cluster centers (-c5). A gaussian kernel function is going to be used for the Kernel-density estimator (-gaussian). The algorithm will stop when the sigma value (kernel width) doesn't vary more than 1e-7 between iterations (-eps 1e-7). Since the-saveclusters parameter is used a list of input data assigned to each cluster is stored in thetest.0 totest.4 files. In this case a progress bar and elpased/estimated time will be shown in the output console (-verb 1).
In this case, the following files are going to be generated:
-
test.codThe final code vector file in the format described above -
test.infInformation file about the parameters used and the resulting Kernel width and quantification error -
test.hisInformation about the number of input vectors assigned to each code vector. It is like an histogram -
test.errAverage quantization error for each code vector. It is useful to know the value of the cost function and the kernel width -
test.0totest.4Each file is a list of the input data vectors assigned to each codevector (cluster)
--Main.AlfredoSolano - 24 Jan 2007