Skip to content

KCMeans

Adrian Quintana edited this page Dec 11, 2017 · 1 revision

classify_kcmeans

(syntaxis changed as of version 1.2)

Purpose

KCMeans stands for "Kernel Probability Density Estimator c-means". It is a clustering algorithm based on kernel density estimator. For more information, please see the following reference:


A Novel Neural Network Technique for Analysis and Classification of EM Single-Particle Images
A. Pascual-Montano, L. E. Donate, M. Valle, M. Bárcena, R. D. Pascual-Marqui, J. M. Carazo 
Journal of Structural Biology, Vol. 133, No. 2/3, Feb 2001, pp. 233-245


Usage


$ classify_kcmeans ...


Parameters

  • ``The input data file (raw file). It should be a text file with each row representing the data items and each column representing the variables:
 3 1000 12 34 54 -12 45 76 ... 32 45 76 

The first line indicates the dimension of the vectors (in this case 3) and the number of vectors (in this case 1000). Please note that vector components (variables) are separated by empty spaces. Additionally, the last column can also be used as a label for the vector. Example:

 3 1000 12 34 54     labelA -12 45 76   labelB ... 32 45 76     labelN 
  • `` The output cluster centers. This parameter will set the base name for the generated output files. KCMeans produces several files with different information and all of them will use this name but with different extensions. The generated files will be:
    • basename.codResulting code vectors. The generated code vectors also follows the same format as the input data, except that a few extra information is also stored in the first line of the file. Example:
 3 rect 5 1 gaussian 11 31 52     labelA -10 43 71    labelB ... 29 39 71     labelN 

The first line first indicates the dimension of the vectors (in this case 3) , the rest of the informaiton included in that line makes no sense for this algorithm but it is necessary in order to be compatible compatible with the Kohonen's SOM_PAKHttp://www.cis.hut.fi/research/som_lvq_pak.shtmlPackage.

  • basename.infInformation file about the parameters used and the resulting Kernel width and quantification error. It will look like this:
 Kernel Probability Density Estimator Clustering  Algorithm Kernel c-Means Input data file : test.dat Code vectors output file : test.cod Whole codebook output file : test.cbk Algorithm information output file : test.inf Number of feature vectors: 93 Number of variables: 3 Number of clusters  = 5 Input data not normalized Gaussian Kernel function Total number of iterations = 200 Stopping criteria (eps) = 1e-07 Final Sigma = 0.272525 Quantization error : 0.3627 
  • basename.his Information about the number of input vectors assigned to each code vector. It is like an histogram of the resulting code vectors. The file contains two columns: the first column is the number of the code vector and the second column is the number of input vectors assigned to it
  • basename.err Average quantization error for each code vector. The file contains two columns: the first column is the number of the code vector and the second column is the average quantization error for each codevector.
  • basename.vs Original input vectors assigment for each code vector (contains the classification information)
  • `` The input code vectors file. This parameter is optional and it is useful when the code vectors are going to be initialized with a set of predefined values. Usually when a several runs of the algorithm are going to be used and the output of one run is going to be used as input to the next one
  • `` Save a file for each code vector with a list of the input items that were assigned to it. It will generate a file for each codevector containing a list of the indexes of the input vectors assigned to it. Example: If 5 clusters are used, then 5 files named[basename].[Codevector Index] (`[baseneme.0]`,`[basename.1]`, etc) will be generated
  • `` Number of clusters
  • `` Gaussian Kernel Function (default)
  • `` t-Student Kernel Function
  • `` If tStudent is used, then it provides the degrees of freedom (Default = 3) Since KCMeans is based on the Kernel Probability Density Estimator, the Kernel function used here is very important. In this version of the program two kernels are supported: Gaussian and TStudent. In case TStudent is used, the degrees of freedoms should also be suplied (3 seems to be a reasonable value and it is set by default)
  • `` This parameter will set the number of iterations used in the algorithm. By default 200 is used
  • `` Stopping criteria. This means that the algorithm will stop when the sigma value (kernel width) doesn't vary more thaneps between iterations or when the number of iteration steps are reached. By default a value of 1e-7 is used.
  • `` If this parameter is used (by default is not used) the input data will be normalized
  • `` Information level that is given as output while running:
    • `` No information (default)
    • `` Progress bar with the elapsed time and estimated time to finish
    • `` Sigma changes between iterations

Examples and notes

Example 1: Use Kernel c-means to cluster a set of data stored in "test.dat" file into a 5 clusters


$ classify_kcmeans -i test.dat -o test -c 5  -verb 1 -saveclusters


In this case the following parameters are set by default:


Input data file : test.dat
Output file name : test
Number of clusters = 5
Gaussian Kernel function
Total number of iterations = 200
Stopping criteria (eps) = 1e-07
verbosity level = 1
Do not normalize input data


In this case we are going to generate 5 cluster centers (-c5). A gaussian kernel function is going to be used for the Kernel-density estimator (-gaussian). The algorithm will stop when the sigma value (kernel width) doesn't vary more than 1e-7 between iterations (-eps 1e-7). Since the-saveclusters parameter is used a list of input data assigned to each cluster is stored in thetest.0 totest.4 files. In this case a progress bar and elpased/estimated time will be shown in the output console (-verb 1).

In this case, the following files are going to be generated:

  • test.cod The final code vector file in the format described above
  • test.inf Information file about the parameters used and the resulting Kernel width and quantification error
  • test.his Information about the number of input vectors assigned to each code vector. It is like an histogram
  • test.err Average quantization error for each code vector. It is useful to know the value of the cost function and the kernel width
  • test.0 totest.4 Each file is a list of the input data vectors assigned to each codevector (cluster)

--Main.AlfredoSolano - 24 Jan 2007

Clone this wiki locally