Sunday, 27 July 2014

(GSoC 2014) Progress report for 07/27/14

Great progress! I submitted implementations of multi-layer perceptron (mlp-link) (mlp-pretraining-link) and extreme learning machines (elm-link) and their documentations as well. Yet many improvements could be made through revisions and my mentors' momentous support that they always provided throughout the summer.

Besides many corrections, a lot have been added since the last post - here is an overview,

1)  Documentation

I  wrote, with the help of my mentor, a documentation (link) on extreme learning machines (ELM), which briefly describes ELM's scheme and the main hyperparameters it possesses.  It also explains tips on why adjusting those parameters are important since noisy, small datasets need a different approach than large, clean datasets. Further, a brief tutorial was given to help users set up and train ELM objects. Finally, a mathematical overview was given describing the function developed from training an ELM and the kind of algorithm it uses to solve the unknown coefficients in that function.

I believe the document can be made more necessarily comprehensive by adding details that describe other ELM parameters such as recursive least-square learning, and details that describe how different kernels affect the decision function. I plan to address these fixes before next week.

2) Example

I added an example illustrating the effect of weight_scale and C, two hyperparameters in ELM.

C is a regularization term that constrains the coefficients of the hidden-to-output weights
weight_scale scales the range of values that input-to-hidden weights can take.

Their effects are illustrated in Figure 1 and 2, where the value of the chosen parameter is given above the corresponding decision function.

Figure 1: Effect of varying the regularization term C on variance.

Figure 2: Effect of varying the regularization term weight_scale on variance.

As shown, increasing the value of weight_scale or C makes for a more nonlinear decision function as you may notice the plots corresponding to higher values are of more curvy structure.

I am currently running ELM on the Covertype dataset (link). The results, however, aren't yet promising as ELM achieved a poor performance of 17% error-rate with as many as 1000 hidden neurons. The training error is still high, which means higher number of hidden neurons will likely reduce the error-rate. But even with 2000 hidden neurons, the error-rate was only reduced to 16,8%. The reason is,  Covertype has 54 features, so a much larger representation (produced by the hidden neurons) of the dataset is not adding any significant information. Therefore, I will explore other parameters using grid search in the hopes to significantly reduce that error-rate.

Sunday, 13 July 2014

(GSoC 2014) Progress report for 07/13/14

I completed the requirements for GSoC 2014, except for the documentation which I am leaving for the remaining duration of the program. Since the mid-term evaluation I implemented the following,

1) Regularized and Weighted Extreme Learning Machines (ELMs);

2) Sequental Extreme Learning Machines;

3) Kernels for ELMs; and

4) Relevant test cases and examples.

I will explain the implementations in more detail below.

1) Regularized and Weighted ELMs

Assuming H is the hidden activations, $\beta$ is the hidden layer outgoing weights, and y is the target vector; regularized ELMs solves the following equation,

$\beta = (HH' + I/C)'Hy$
where I is the identity matrix.

The regularization term C determines the decision boundary degree of linearity. Figure 1 shows how regularization - or reducing C - leads to a linear function.

(Figure 1: non-regularized (left side) vs. regularized (right side) decision boundary)

Weighted ELMs is different from regularized ELMs, in that a diagonal weight matrix $W$ is added to the equation, yielding the following,

$\beta = (HWH' + I/C)'HWy$

Index $(i, i)$ in $W$ corresponds to how much weight is given to sample $i$, depending on the sample's class. This scheme is used to address the problem with imbalanced datasets; where a class is underrepresented by having few samples compared to other classes and therefore ignored by classifiers. Such minority classes are given higher weights such that the decision boundary is pushed away from them. Figure 2 shows the difference between applying weighting schemes for the minority class, the orange samples.

                   (Figure 2: no weights (left side); 0.618/(#samples) weight given to each class                                (middle side); 1000 weight cost given to the orange class (right side))

2) Sequential ELMs

Dealing with million sample datasets is problematic when they have to be in memory all at once for training. Sequential ELMs mitigates this limitation by breaking the dataset into batches and trains on them by per-batch basis using a recursive equation which is but a subtle representation of ELM's original equation. Unfortunately the implementation does not support weights yet.

3) Kernels for ELMs
The standard initialization of ELM input weights is the result of a random kernel. However, other kernels, which are best known for training SVMs, can be used to get new hidden activations - like Radial Basis Function, Linear kernel, and the Polynomial kernel.

For the remaining time of GSoC 2014, I will complete the ELMs documentation and add any necessary changes for the completed work.