About GAIA GAIA Case Studies Global GIS Agenda 21 Country Data Model Database

Supervised learning is used for data sets with cases having known outcomes; this type of learning is the more common form because data is usually collected with outcome in mind. On the other hand, unsupervised learning is more appropriate when the user does not know the sub-divisions into the which the data samples, using relevant predictor features, should be divided. That might be the case for a new problem for the user has little experience. Using a computer program for developing rules based upon a series of the unknown input patterns.

Self-Organizing Learning Algorithm

The original ideas behind the self-organizing networks algorithm were presented by Teuvo Kohonen (1988). This algorithm was originally developed for speech recognition. The self-organizing map algorithm is described briefly as follows:

Step 1: The adaptive weights are randomly initialized if the training is a new start, and loaded if the network has been trained before.

Step 2: Input vector Xi is presented at input nodes.

Xi = {x1, x2, ..., xn} for i = 1, 2, ..., n

Step 3: For each neuron compute the output by using Euclidean distance function

for j=1, 2, ..., p

Step 4: Select the minimum criterion among all outputs

Step 5: The updated weights are computed only for matching neurons:

Wij(t+1) = Wij(t) + ((t)[(Xi(t) - Wij(t)]

This updates the weights for the jth cluster.

Step 6: The algorithm is terminated if there are no further changes to weight vectors (((0). Otherwise returns to Step 2.

Functions of the Training Algorithm

The training algorithm consists of folowing functions:

Function 1 - Initialize Training Parameters

  • The weight is initialized using random numbers.

  • The neighborhood function type is step function (bubble or guassian).

  • The input vectors are submitted to the network random or sequential.Every element in the input vector Xi is presented as Xi = [xi1,xi2,..,xin], where xi1,xi2,..,xin the attributes of data vectors to be simulated for finding the relationship of clusters.

  • The size of the map is n*n which shows the number of neurons. The lattice type can be hexagonal or rectangular. The output vectors are got by a mapping between the input and the weight matrix. The mapping neuron denotes the clustering of data in the database according to its attributes.

  • Training Constant (() is an initial radius of the training rate parameter. Decreases linearly to zero during training. The linear function is defined as :

    ((t) = ((0)(1.0 - t/rlen)

    where :

    (t) is training constant ( is a very small value between 0 and 1)
    (0) is initial value . Initially, 0=0.05 and it will be reduced approximately to zero when the training process is completed.
    rlen is running length (number of iterations)
    t is number of iteration in step t

    Function 2 - Reading the input vectors

    The input vector which is initialized is read into the training network randomly or sequentially. The input vectors should be normalized. This normalization is done to stop the larger values from dominating the vector (Kohonen and Ritter, 1989)

    The Algorithm for Normalization is as follows:
    Algorithm Normalize

    Sum_Square_Vector = Sum of Squares of Vector Components
    Euclidean Norm = Square Root (Sum_Square_Vector)
    Unit_length_Vector = (Component / Length)
    End Normalize

    Function 3 - Weight vector initialization

    The weight vectors are randomly initialized when the network is trained for the first time. The weight vectors are random values between 0 and 1.

    The Algorithm for Weight Initialization is as follows:

    Algorithm Weight_Initialize
    For the complete output vector i (from 1 to output pattern dimension)
    For the complete input vector j (from 1 to feature space dimension)
    Weight_vector, Wij = {Randomize between 0 and 1}
    End Weight_Initialize.

    Function 4 - Self-organize network training

    The input vector is randomly or sequentially drawn from the input vector matrix and presented to the network one by one. The input vector comes one by one, the winner is found and the weights of the neighborhood are adapted. As the training progresses, the neighborhood are linearly decreased till the index of order approaches zero or the training constant approaches zero (=0). The index of order, D is the mean square distance between the input vectors and the weight vectors.

    D=(1/|I|)*(SUM|X-W|**2) where I is the number of input patterns.

    The Algorithm for Self Organization is as follows:

    Algorithm Self Organization
    For 1 = Number of Iterations before reducing the Neighborhood
    Take one input
    Procedure Winner_select
    Procedure Adaptation_neighbourhood
    Procedure Index_calculation
    End For
    End Self Organization

    The Winner Unit is selected as follows:
    Algorithm Winner_Select
    Compute Euclidean distance between Weight vector and Input Vector
    for each Output Unit
    Winner Node = Weight Vector which has minimum Euclidean
    Distance to the Input Vector
    Euclidean Distance = (Sum(Weights-Input)2)
    End Winner_select

    The Weights are adapted as follows:
    Algorithm Adaptation_Neighbourhood
    For each Unit Within the Neighborhood
    Each weight vector changes by a fraction of its difference from the
    Input Vector.
    Hence Adapt the Weight
    End Adaptation_Neighbourhood.

    The Index of Order is Calculated as follows:
    Here D is The Index of Order Algorithm Index_Calculation
    D = Square distance ((Input vector - Average weight Vector) of mapped Unit)
    End Index_Calculation

    Function 5 - Produce the Feature Map

    The labeling of output neuron is done using Algorithm Label_Output_Space.

    Algorithm Label_Output_Space
    For the Complete Input Vector
    Find the Winning Node
    Label the Winning node with the Input Identifier from the Feature Matrix
    End Label_Output_Space

    Once the labeling is over, the output map can be drawn.

    Function 6 - Conclude the training

    Before the feature map is produced, the input vectors should have undergone some iterations of training. The number of iterations passed depends on the size or the output map. This parameter has to be checked. This is done using Algorithm Iterations_Check

    Algorithm Iterations_Check
    For 1 to Number of Iterations till Iterations becomes Zero
    Procedure Self_Organize
    Produce the Temporary Feature Map
    Decrement Number of Iterations
    Get the Final Feature Map
    End Iterations_Check

    The training is stopped after the training constant linearly decreases to zero and the neighborhood radius also reduces to zero.

    Function 7 - Evaluation of quantization error

    When the entries in the map have been trained to their final values, the resulting quantization error can be evaluates. The training Algorithm Q_Error is used for this purpose. It is possible to compute a weighted quantization error (hci||x - wi || for each input sample and average these over the data set. The average mismatch error value will increase if the output map size is deceased. So, a bigger map can show more distinct clusters and hence is advantageous. However, as the size of output map is increased, the number of iterations is also automatically increased and the time taken to perform the training will be more.

    Function 8 - U-matrix algorithm


    : Take the n*n*m weight matrix
    For all neurons (i=1 ..n, j=1..n) do
    calculate vector difference to all eight neighboring neurons on the map to
    [(i-1, j-1), (i-1, j), (i+1, j), (i, j-1), (i, j+1), (i+1, j-1), (i+1, j), (i+1, j+1)]
    sum it up
    divide through eight
    store the result
    get a n*n*1 matrix


    n is the size of the topological map
    m is number of attributes

    Using the Training Programs

    The software package contains all programs necessary for the correct application of the Self-Organizing Map (SOM) algorithm in the visualization of complex experimental data (Kohonen, 1989). Usually the subprograms contained in this package are run separately and directly from the console using command lines defined in the stages of using command lines. The programs must be run in the correct order: first initialization, then training, and then tests; and that the correct parameters are given. The package includes programs planes to encapsulated PostScript (.ps) images and program umat to compute so-called u-matrix visualization (Ultsch, 1993) of the SOM reference vectors.

    The programs require various parameters: file names, learning parameters, sizes of maps, etc. All these must be given to the program in the beginning; the programs are not interactive in the sense that they do not ask for any parameters during their running. The parameters can be given in any order in the commands. All the parameters that are required by any program in this package are listed below.

    Name of the input data file.
    Name of the output data file.
    Name of the file from which the reference vectors are read.
    Name of the file to which the reference vectors are stored.
    Running length (number of steps) in training.
    Initial learning rate parameter. Decreases linearly
    to zero during training.
    Initial radius of training area in SOM algorithm.
    Decreases linearly to one during learning.
    Number of units in the x-direction.
    Number of units in the y-direction.
    The topology type used in map. Possible choices are hexagonal
    lattice (hexa) and rectangular lattice (rect).
    The neighborhood function type used. Possible choices are step
    function (bubble) and Gaussian (gaussian).
    The component plane of the reference vectors that displayed in the conversion routines.
    Parameter that defines whether a new seed for the random-number
    generator is defined.

    In the initialization and training programs the random-number generator is used to select the order of training samples, etc. The parameter -rand defines whether a new seed for the random number generator is given; when any number than zero is given, that number is used as the seed, otherwise the seed is read from the system clock. The default value is zero.

    Using the Command Lines

    First Stage: Map Initialization

    The reference vectors of the map are first initialized to tentative values. The lattice type of the map and the neighborhood function used in the training procedures are also defined in the initialization.

    randinit - This program initializes the reference vectors to random values. The vector components are set to random values that are evenly distributed in the area of corresponding data vector components. The size of the map is given by defining the x-dimension (-xdim) and the y-dimension (-ydim) of the map. The topology of the map is defined with option (-topol) and is either hexagonal (hexa) or rectangular (rect). The neighborhood function is defined with option (-neigh) and either step function (bubble) or Guassian (guassian). The map size is here 12 time 8 units.

    > randinit -din file.dat -cout file.cod -xdim 12 -ydim 8 -topol hexa -neigh bubble -rand 123

    Now the map has been initialized.

    Second Stage: Map Training

    The map is trained by the self-organizing map algorithm using the program vsom. Training is done in two phases:

    First is the ordering phase during which the reference vectors of the map units are ordered. In the beginning the neighborhood radius is taken almost equal to the diameter of the map and decreases to one during training, while the learning rate decreases to zero.

    > vsom -din file.dat -cin file.cod -cout file.cod -rlen 1000 -alpha 0.05 -radius 10

    During the second phase the reference vectors in each unit converge to their correct value. The second phase is usually longer than the first one. The learning rate is thereby smaller. The radius is also smaller on the average: in the beginning the units up to a distance of three are covered. In this example the training time of the second phase is ten times longer than in the first phase.

    > vsom -din file.dat -cin file.cod -cout file.cod -rlen 10000 -alpha 0.02 -radius 3

    After these two phases of training the map is ready to be tested and to be used in monitoring applications.

    Third Stage: Evaluation of the Quantization Error

    When the entries in the map have been trained to their final values, the resulting quantization error can be evaluated. The training file is used for this purpose. The program qerror is used to evaluate the average quantization error.

    > qerror -din file.dat -cin file.cod

    This program computes the quantization error over all the samples in the data file. The average quantization error with the training set is also calculated. For each input sample vector the best-matching unit in the map is searched for and the average of respective quantization errors is returned.

    Fourth Stage: Map Visualization

    The trained map can be used for visualization of data samples. The visualization programs make an image of the map and plot the trajectory of the best-matching units versus time upon it. Before visualization, the map units are calibrated using known input data samples.

    vcal - This program labels the map units according to the samples in the input data file. The best-matching unit in the map corresponding to each data vector is searched for. The map units are then labeled according to the majority of labels hitting a particular map unit. The units that get no hits are left unlabeled.

    > vcal -din file.dat -cin file.cod -cout file.cod

    After calibration some of the units in the map have labels showing an area in the map which corresponds to fatal states.

    Fifth Stage: Monitoring Programs

    visual - This program generates a list of coordinates corresponding to the best-matching unit in the map for each data sample in the data file. It also gives the individual quantization errors made and the class labels of the best matching units if the latter have been defined. The program will store the three dimensional image points (Coordinate values and the quantization error) in a similar fashion as the input data entries are stored.

    > visual -din file.dat -cin file.cod -dout file.vis

    planes - This program generates an encapsulated postscript code from one selected component plane of the map imaging the values of the components using gray levels.

    > planes -cin file.cod -plane 1 [-din file.dat] -ps 1

    The ps files are named using the map base name and adding _p1.ps. If the input data file is also given, the trajectory formed of the best-matching units is also converge to a separate file. The trajectory file is named accordingly adding _tr.ps to the base name.

    umat - This program generates an encapsulated postscript code to visualize the distance between reference vectors of neighboring map units using gray levels.

    > umat -cin file.cod [-average 1] -ps 1> file.ps

    If -average option is given the gray levels of the image are spatially filtered by averaging, and if -median option is given the median filtering is used.

  • Copyright 1995-2002 by:   ESS   Environmental Software and Services GmbH AUSTRIA