## Sunday, 17 August 2014

### (GSoC 2014) Final Summary (Neural Networks)

GSoC 2014 has been an extraordinary experience. Not only did it encourage me to develop much needed open-source implementation of neural network algorithms, but also exposed me to a great, diverse community. I also learned useful practices for maintaining clean, quality code and writing accessible documentation. This prepared me to work well, and efficiently under pressure, since quality work had to be produced in a short period of time.

In terms of the requirements, the three algorithms mentioned in the proposal - (1) Multi-layer perceptron, (2) Multi-layer perceptron with pre-training, and (3) Extreme Learning Machines - are completed (see below for a comprehansive description). In terms of specific requirements, there has been a lot of changes in order to accommodate positive suggestions, especially for MLP, and ELM. While a part of MLP was completed prior to the start of GSoC, the code went through a complete renovation, which made it faster, more readable, more scalable, and more robust. In fact, most of the work involved was optimizing the speed of execution, improving readability - this includes proper naming and convenient infrastructure of the code base, and writing a comprehensive documentation. The algorithms are explained in more detail below.

Acknowledgements
This wouldn't have been possible without the profound, sincere assistance of my mentor Olivier Grisel, and the scikit-learn team - including, Arnaud Joly, Gael Varoquaux, Kyle Kastner, Jnothman, Lars Buitinck, and many more. I sincerely thank the PSA team for emphasizing on summarizing my work as blog posts here and I do greatly appreciate Google's significant support it offered, which was instrumental in the successful completion of this project.

(1) Multi-layer perceptron (MLP) (link: #3204)
----------------------------------------------------------------------------
Figure 1: One hidden layer MLP

This  implements the classic backpropagation algorithm supporting one or more hidden layers (see Figure 1). Depending on the problem type (classification or regression), backpropagation optimizes an objective function whose main purpose is to have the predicted output as close to the target as possible, though subject to some constraints like regularization.

The MLP supports L2 regularization which controls the degree to which it is overfitting. Increased regularization constrains the trained weights to be of smaller value which makes the decision function more linear.

We also added a renowned activation function known as rectified linear unit (ReLU) function, which not only is faster, but allows training more than one hidden layer efficiently, at least more than hyperbolic tan and logistic [1].

Unit testing was made thorough as 99.17% of the statements were covered. A gradient unit test helped make sure the activation functions - hyperbolic tan, logistic, and ReLU - work as expected.

After the mid-term, much of the code was renovated. Many methods were combined to simplify the code and improve readability. Performance was improved by removing redundant calls and  taking advantage of pre-allocation of matrices - including, values of activation layers, gradients, and weights. Many private variables were removed, making pickling less prone to error and less dense.

MLP might benefit from a scheme known as pre-training which is explained in section 2.

(2) Multi-layer perceptron with pre-training (link:
-------------------------------------------------------------------

Figure 2: Pre-training scheme using restricted boltzmann machines.

Prior to running backpropagation, an unsupervised learner can provide the MLP with initial weights that might be better than randomized weights. The parameters optimized by the unsupervised learner - after training on the dataset - can be assigned to the MLP weight parameters as starting points.

The motivation is that these initial weights are meant to allow backpropagation to converge in a better local optima than otherwise [2].

Figure 2 illustrates the scheme of using pre-training with multi-layer perceptron. For each set of weights between two layers, a restricted boltzmann machine (RBMs) trains on the input data of the previous layer and the final parameters are assigned to these set of weights in the large multi-layer perceptron.

An example was set to compare the performance of multi-layer perceptron (MLP) with and without pre-training using RBMs [3]. MLP without pre-training had its parameters initialized using scaled, random distribution. For pre-training, an RBM trains on the digits dataset and the resultant parameters are given to MLP as initial coefficient and intercept parameters. Below are the testing scores against the digits dataset [4],

Testing accuracy of mlp without pretraining: 0.967
Testing accuracy of mlp with pretraining: 0.978

However, it is not always the case that pretraining improves performance. In some occasions, especially when dealing with large training sets, it could even decrease the score.

(3) Extreme Learning Machines (link: #3306)
----------------------------------------------------

Figure 3: Neural network for ELM

The main focus after the mid-term evaluations was on developing extreme learning machines (ELMs). First we implemented the standard algorithm of ELMs that optimize an objective function using least-square solutions.

An ELM has a similar network as a one hidden layer MLP, except the output layer has no bias (see Figure 3). ELM, basically, trains a network through these 3 steps,

• it applies a random projection to the input space, onto a possibly higher dimensional space;
• the result passes through an element-wise non-linear activation function, typically a sigmoid such as the tanh, and logistic functions; and
• last, it trains a linear one vs. rest classifier or a multi-output ridge regression model.

The algorithm trains a single-hidden layer feedforward network by computing the hidden layer values using randomized parameters, then solving  for the output weights using least-square solutions. For prediction, after computing the forward pass, the continuous output values pass through a gate function converting them to integers that represent classes. The function representing  ELM is given as, $y=\beta\cdot f(W^T \cdot X + b )$

where matrices $X$ and $y$ represent the input samples and target
values, respectively; matrices $W$ and $b$ are randomly generated based on a uniform distribution; matrix $\beta$ contains unknown variables; and $f(\cdot)$ is the non-linear, component-wise activation function.
ELM solves for $\beta$ using the ridge regression implementation, given as,  $(H^T H + (1 / C) * I)^{-1} H^T y$ where $H = f(W^TX+b)$, $C$ is a regularization term which controls the linearity of the decision function, and $I$ is the identity matrix.
We demonstrated the effects of tuning two main hyperparameters,
• weight_scale, which controls the variance of the random projection weights, the higher the value the more the less the regularization and therefore more overfitting.
• C, which controls the regularization strength of the output linear model, which regularizes the hidden-to-output weights in the same way as weight_scale regularizes the input-to-hidden weights.

and another main hyperparameter,
• n_hidden, which controls the number of hidden layer nodes.

Below are 3 plots that illustrate the effect of these parameters on score,

Figure 4: Effect of weight_scale on decision function.

Figure 5: Effect of C on decision function.

Figure 6: Effect of weight_scale and C on the scores against the Digits dataset.

Figures 4 and 5 show how increasing the regularization terms C would lead to a more non-linear decision function.

Figure 6 shows a colour map representing scores returned by grid-search illustrating the fact that a balance between C and weight_scale is important to have a higher score. C=1.0 and weight_scale=10  achieved the highest score as indicated by the darkest shade of the relevant blue square.

We re-used ridge regression [5] implementation for solving the least-square solution as it optimizes training speed for different data types. Next, we implemented the sequential algorithm of the ELM. It allows ELM to train on the dataset in batches, while, interestingly, the end result is exactly the same as though the whole dataset is put into memory. However, decreasing the size of the batches, can potentially increase training time. Below is a benchmark showing the training time in seconds of training ELMs with different batch sizes on a 10000 image MNIST dataset.

batch_size        50 hidden neurons            500 hidden neurons
----------------------------------------------------------------------
None                0.32s                                 2.21s
10                     0.71s                                13.62s
100                   0.33s                                3.30s
1000                 0.32s                                2.44s

batch_size=None means that the whole dataset is put into memory. As shown in the benchmark, Training on larger batch sizes improves speed but could cause memory error if the batch size is too large. Nevertheless, using batches the algorithm supports on-line learning and therefore it can update its solutions as the data arrives in chunks.

Also, support for kernel was added to ELM, which was later removed for reasons that will soon appear. ELM originally transforms the input data into hidden activations depending on the number of hidden neurons. Similarly, kernels, like in SVM, transform the input data into another dimensional space where hidden neurons play no role. Empirically, kernels were found to be slow, yet lead to no accuracy improvement over the standard ELM. For that reason and to avoid feature creep, we removed kernel support.

However, we added support of the ReLU activation function, hyperbolic tan, and logistic. They were put in an external file so that they can be shared between different modules in scikit-learn .

Further, we updated another file [6] that is responsible for assigning class weights, useful for several algorithms that support weighted classification. We added method that computes the weights  corresponding to each sample as a vector to allow ridge-regression to run weighted least-square solutions in the ELM.

We also improved testing coverage. ELM has a coverage of 100% of the code, making it reliable. Testings were made to make sure, that weighted ELM does improve results in instances of imbalanced datasets; that higher number of hidden neurons does improve the training score; and that whether the algorithm runs using batch-based or not should produce the same end result.

To conclude, this experience was special and useful in that it brought me closer to the scikit-learn community and other open-source communities. It also encouraged me to satisfy my long ambition of implementing useful algorithms and writing accessible documentation for any user who wish to delve into the world of neural networks.

I sincerely look forward to continue working with the scikit-learn team for the years to come and I sincerely look forward to participating in GSoC 2015, either as a mentor or as a student.

References
--------------

[1] Maas, Andrew L., Awni Y. Hannun, and Andrew Y. Ng. "Rectifier nonlinearities improve neural network acoustic models." ICML Workshop on Deep Learning for Audio, Speech, and Language Processing. 2013.

[2] Hinton, Geoffrey E., and Ruslan R. Salakhutdinov. "Reducing the dimensionality of data with neural networks." Science 313.5786 (2006): 504-507.

[4] http://scikit-learn.org/stable/auto_examples/datasets/plot_digits_last_image.html

[5] http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html

------------------

1) Multi-layer perceptron: https://github.com/scikit-learn/scikit-learn/pull/3204

2) Multi-layer perceptron with pre-training: https://github.com/scikit-learn/scikit-learn/pull/3281

3) Extreme learning machines: https://github.com/scikit-learn/scikit-learn/pull/3306

## Sunday, 27 July 2014

### (GSoC 2014) Progress report for 07/27/14

Great progress! I submitted implementations of multi-layer perceptron (mlp-link) (mlp-pretraining-link) and extreme learning machines (elm-link) and their documentations as well. Yet many improvements could be made through revisions and my mentors' momentous support that they always provided throughout the summer.

Besides many corrections, a lot have been added since the last post - here is an overview,

1)  Documentation

I  wrote, with the help of my mentor, a documentation (link) on extreme learning machines (ELM), which briefly describes ELM's scheme and the main hyperparameters it possesses.  It also explains tips on why adjusting those parameters are important since noisy, small datasets need a different approach than large, clean datasets. Further, a brief tutorial was given to help users set up and train ELM objects. Finally, a mathematical overview was given describing the function developed from training an ELM and the kind of algorithm it uses to solve the unknown coefficients in that function.

I believe the document can be made more necessarily comprehensive by adding details that describe other ELM parameters such as recursive least-square learning, and details that describe how different kernels affect the decision function. I plan to address these fixes before next week.

2) Example

I added an example illustrating the effect of weight_scale and C, two hyperparameters in ELM.

C is a regularization term that constrains the coefficients of the hidden-to-output weights
weight_scale scales the range of values that input-to-hidden weights can take.

Their effects are illustrated in Figure 1 and 2, where the value of the chosen parameter is given above the corresponding decision function.

Figure 1: Effect of varying the regularization term C on variance.

Figure 2: Effect of varying the regularization term weight_scale on variance.

As shown, increasing the value of weight_scale or C makes for a more nonlinear decision function as you may notice the plots corresponding to higher values are of more curvy structure.

I am currently running ELM on the Covertype dataset (link). The results, however, aren't yet promising as ELM achieved a poor performance of 17% error-rate with as many as 1000 hidden neurons. The training error is still high, which means higher number of hidden neurons will likely reduce the error-rate. But even with 2000 hidden neurons, the error-rate was only reduced to 16,8%. The reason is,  Covertype has 54 features, so a much larger representation (produced by the hidden neurons) of the dataset is not adding any significant information. Therefore, I will explore other parameters using grid search in the hopes to significantly reduce that error-rate.

## Sunday, 13 July 2014

### (GSoC 2014) Progress report for 07/13/14

I completed the requirements for GSoC 2014, except for the documentation which I am leaving for the remaining duration of the program. Since the mid-term evaluation I implemented the following,

1) Regularized and Weighted Extreme Learning Machines (ELMs);

2) Sequental Extreme Learning Machines;

3) Kernels for ELMs; and

4) Relevant test cases and examples.

I will explain the implementations in more detail below.

1) Regularized and Weighted ELMs

Assuming H is the hidden activations, $\beta$ is the hidden layer outgoing weights, and y is the target vector; regularized ELMs solves the following equation,

$\beta = (HH' + I/C)'Hy$
where I is the identity matrix.

The regularization term C determines the decision boundary degree of linearity. Figure 1 shows how regularization - or reducing C - leads to a linear function.

(Figure 1: non-regularized (left side) vs. regularized (right side) decision boundary)

Weighted ELMs is different from regularized ELMs, in that a diagonal weight matrix $W$ is added to the equation, yielding the following,

$\beta = (HWH' + I/C)'HWy$

Index $(i, i)$ in $W$ corresponds to how much weight is given to sample $i$, depending on the sample's class. This scheme is used to address the problem with imbalanced datasets; where a class is underrepresented by having few samples compared to other classes and therefore ignored by classifiers. Such minority classes are given higher weights such that the decision boundary is pushed away from them. Figure 2 shows the difference between applying weighting schemes for the minority class, the orange samples.

(Figure 2: no weights (left side); 0.618/(#samples) weight given to each class                                (middle side); 1000 weight cost given to the orange class (right side))

2) Sequential ELMs

Dealing with million sample datasets is problematic when they have to be in memory all at once for training. Sequential ELMs mitigates this limitation by breaking the dataset into batches and trains on them by per-batch basis using a recursive equation which is but a subtle representation of ELM's original equation. Unfortunately the implementation does not support weights yet.

3) Kernels for ELMs
The standard initialization of ELM input weights is the result of a random kernel. However, other kernels, which are best known for training SVMs, can be used to get new hidden activations - like Radial Basis Function, Linear kernel, and the Polynomial kernel.

For the remaining time of GSoC 2014, I will complete the ELMs documentation and add any necessary changes for the completed work.

## Sunday, 22 June 2014

Mid-term Summary
GSoC 2014: Extending Neural Networks Module for Scikit-Learn

The objective is to implement neural network algorithms in a clean, well-tested code using the scikit-learn API. The algorithms are meant to be user-friendly and easy to edit and scale for those who wish to extend, or debug them if necessary.

Since the start of GSoC 2014 until now, I completed two modules, multi-layer perceptron (mlp) #3204 and mlp with pre-training #3281, which are pending final review for merging. I also implemented the extreme learning machine (elm) algorithm #3306 which hasn't been reviewed yet and more components such as test files, examples, and documentations are required. However, I am confident that I will complete it by the deadline I set in the proposal.

In the following three sections, I will explain the modules in more detail.

1) Multi-layer perceptron  #3204
Multi-layer perceptron (MLP) supports more than one hidden layer allowing it to construct highly non-linear functions. Figure 1 displays an MLP with 1 hidden layer.

Figure 1

To define the number of hidden layers and their neurons, one can simply run the following statement.

The list '[150, 100]' means that two hidden layers are constructed with 150 and 100, neurons respectively.

Further, MLP can be used for reinforcement learning where each time step makes a new training sample. It can use the partial_fit method to update its weights on per sample basis in real-time (stochastic update).

MLP also consists of a regularization term alpha as part of its parameters, whose value determines the degree of non-linearity the function is meant to have. Therefore, if the algorithm is overfitting, it is desirable to increase alpha to have a more linear function. Figure 2 demonstrates  the decision boundaries learnt by mlp with different alpha values.

Figure 2

Figure 2 shows that the higher the value of alpha, the less curves the decision boundary will have.

The implementation has passed through various test cases to prevent unexpected behavior. One of the test cases involves comparing between the algorithm's analytic computation of the gradient and its numerical computation. Since the difference between the values was found to be at most a very small value means the backpropagation algorithm is working as expected.

2) MLP with pre-training #3281

One issue with MLP is that it involves random weights' initialization. The weights could land in a poor position in the optimization (see Figure 3) whose final solutions are not as good as they could be.

Figure 3

Pre-training is one scheme to have the initial weights land in a better start. Restricted boltzmann machines (RBMs) can find such initial weights. Figure 4 shows the process of pre-training.

Figure 4
For each layer in the network, there is an RBM that trains on the inputs given for that layer. The final weights of the RBM are given as the initial weights of the corresponding layer in the network.

Running an example of pre-training has showed that RBMs can improve the final performance. For instance, on the digits the dataset, the following results were obtained.

1) Testing accuracy of mlp without pretraining: 0.964
2) Testing accuracy of mlp with pretraining: 0.978

3) Extreme learning machine (elm) #3306

Much of the criticism towards MLP is in its long training time. MLP uses the slow gradient descent to updates its weights iteratively, involving many demanding computations.

Extreme learning machines (ELMs) [1], on the other hand, can train single hidden layer feedforward networks (SLFNs) using least square solutions instead of gradient descent. This scheme requires only few matrix operations, making it much faster than gradient descent. It also has a strong generalization power since it uses least-squares to find its solutions.

The algorithm has been implemented and it passed the travis tests. But it still awaits more thorough review and test files to anticipate errors.

I believe I will finalize the module by 29 June as per the proposal.

Remaining work

In the remaining weeks my tasks are broken down as follows.

Week 7, 8 (June 30 - July 13)

I will implement and revise regularized ELMs [3] and weighted ELMs [4], and extend the ELMs documentation.

Week 9, 10  (July 14- July 27)

I will implement and revise Sequential ELMs [2], and extend the ELMs documentation.

Week 11, 12 (July 28- August 10)

I will implement and revise Kernel-Based ELMs, and extend the ELMs documentation.

Week 13 - Wrap-up

Acknowledgements

I would like to thank my mentors and reviewers including @ogrisel, @larsmans @jnothman, @kasternkyle, @AlexanderFabisch for dedicating their time in providing useful feedback and comments, making sure the work meets high-quality standards. I sincerely appreciate the time PSF admins take to oversee the contributers as it encourages us to set a higher bar for quality work. I would also like to thank GSoC 2014, as this wouldn't have been possible if it hadn't been for their support and motivation.

References

[1]    http://www.di.unito.it/~cancelli/retineu11_12/ELM-NC-2006.pdf

[2]    http://www.ntu.edu.sg/home/egbhuang/pdf/OS-ELM-TNN.pdf

[3]    http://www.ntu.edu.sg/home/egbhuang/pdf/ELM-Unified-Learning.pdf

[4]   Zong, Weiwei, Guang-Bin Huang, and Yiqiang Chen. "Weighted extreme learning machine for imbalance learning." Neurocomputing 101 (2013): 229-242.

## Thursday, 12 June 2014

### (Week 3) GSoC 2014: Extending Neural Networks Module for Scikit learn

This week, with the help of many reviewers I completed a user friendly multi-layer perceptron algorithm in Python. While it is still a Pull Request , the algorithm can be downloaded by following these steps,

1) git clone https://github.com/scikit-learn/scikit-learn
2) cd scikit-learn/
4) git checkout mlp

Creating an MLP classifier is easy. First, import the scikit-learn library; then, initialize an MLP classifier by executing these statements,

from sklearn.neural_network import MultilayerPerceptronClassifier
clf = MultilayerPerceptronClassifier()

If you'd like to have 3 hidden layers of sizes 150-100-50, create an MLP object using this statement,

clf = MultilayerPerceptronClassifier(n_hidden=[150, 100, 50])

Training and testing are done the same way as any learning algorithm in scikit-learn.

In addition, you can tune many parameters for your classifier. Some of the interesting ones are,

1) the algorithm parameter, which allows users to select the type of algorithm for optimizing the neural network weights, which is either stochastic gradient descent SGD and l-bfgs; and

2) the max_iter parameter, which allows users to set the number of maximum iterations the network can run.

After training mlp, you can view the minimum cost achieved by printing mlp.cost_. This gives an idea of how well the algorithm has trained.

The implementation passed high standard tests and achieved expected performance for the MNIST dataset. MLP with one hidden layer of 200 neurons, and 400 iterations achieved great results in the MNIST benchmark compared to other algorithms shown below,

Classification performance
=========================================================================
Classifier                         train-time                          test-time                       error-rate
-------------------------------------------------------------------------------------------------------------
Multi layer Perceptron         655.5s                            0.30s                            0.0169
nystroem_approx_svm         125.0s                            0.91s                            0.0239
ExtraTrees                             79.9s                           0.34s                            0.0272
fourier_approx_svm             148.9s                            0.60s                            0.0488
LogisticRegression                68.9s                            0.14s                            0.0799

For next week, I will implement extreme learning machine algorithms. These use the least-square solution approach for training neural networks, and therefore they are both quick and efficient with interesting applications.

## Sunday, 1 June 2014

### (Week 2) GSoC 2014: Extending Neural Networks Module for Scikit learn

I apologize for not posting my progress in GSoC 2014 for the past week. I was not aware of the weekly blog post requirement. However, I have been posting my progress in the Github link: https://github.com/scikit-learn/scikit-learn/pull/3204

From this week onward, I will announce my weekly progress in detail.

To get an overview of what I have done so far,

1) I fixed the issues with hyperbolic tan activation function for the multi-layer perceptron pull-request I am implementing for scikit-learn; and

2) I extended multi-layer perceptron to support more than one hidden layer with additional unit tests to make sure the algorithm runs correctly.

While I was expected to finish the documentation for multi-layer perceptron (MLP)in the first week, my mentor and I decided to first extend it to support more than one hidden layer.

For next week, I will complete the documentation for the extended MLP as well as address any comments my mentors will issue for the implementation.

I am saving the detailed description of the algorithm for the documentation. Once completed, I will post the documentation here in full.

Thank you.