Great progress! I submitted implementations of multi-layer perceptron (mlp-link) (mlp-pretraining-link) and extreme learning machines (elm-link) and their documentations as well. Yet many improvements could be made through revisions and my mentors' momentous support that they always provided throughout the summer.

Besides many corrections, a lot have been added since the last post - here is an overview,

1)

I wrote, with the help of my mentor, a documentation (link) on extreme learning machines (ELM), which briefly describes ELM's scheme and the main hyperparameters it possesses. It also explains tips on why adjusting those parameters are important since noisy, small datasets need a different approach than large, clean datasets. Further, a brief tutorial was given to help users set up and train ELM objects. Finally, a mathematical overview was given describing the function developed from training an ELM and the kind of algorithm it uses to solve the unknown coefficients in that function.

I believe the document can be made more necessarily comprehensive by adding details that describe other ELM parameters such as recursive least-square learning, and details that describe how different kernels affect the decision function. I plan to address these fixes before next week.

2)

Their effects are illustrated in Figure 1 and 2, where the value of the chosen parameter is given above the corresponding decision function.

Besides many corrections, a lot have been added since the last post - here is an overview,

1)

**Documentation**I wrote, with the help of my mentor, a documentation (link) on extreme learning machines (ELM), which briefly describes ELM's scheme and the main hyperparameters it possesses. It also explains tips on why adjusting those parameters are important since noisy, small datasets need a different approach than large, clean datasets. Further, a brief tutorial was given to help users set up and train ELM objects. Finally, a mathematical overview was given describing the function developed from training an ELM and the kind of algorithm it uses to solve the unknown coefficients in that function.

I believe the document can be made more necessarily comprehensive by adding details that describe other ELM parameters such as recursive least-square learning, and details that describe how different kernels affect the decision function. I plan to address these fixes before next week.

2)

**Example****I added an example illustrating the effect of weight_scale and C, two hyperparameters in ELM.**

**C**is a regularization term that constrains the coefficients of the hidden-to-output weights**weight_scale**scales the range of values that input-to-hidden weights can take.Their effects are illustrated in Figure 1 and 2, where the value of the chosen parameter is given above the corresponding decision function.

Figure 1: Effect of varying the regularization term

**C**on variance.
Figure 2: Effect of varying the regularization term

**weight_scale**on variance.
As shown, increasing the value of

I am currently running ELM on the Covertype dataset (link). The results, however, aren't yet promising as ELM achieved a poor performance of 17% error-rate with as many as 1000 hidden neurons. The training error is still high, which means higher number of hidden neurons will likely reduce the error-rate. But even with 2000 hidden neurons, the error-rate was only reduced to 16,8%. The reason is, Covertype has 54 features, so a much larger representation (produced by the hidden neurons) of the dataset is not adding any significant information. Therefore, I will explore other parameters using grid search in the hopes to significantly reduce that error-rate.

**weight_scale**or**C**makes for a more nonlinear decision function as you may notice the plots corresponding to higher values are of more curvy structure.I am currently running ELM on the Covertype dataset (link). The results, however, aren't yet promising as ELM achieved a poor performance of 17% error-rate with as many as 1000 hidden neurons. The training error is still high, which means higher number of hidden neurons will likely reduce the error-rate. But even with 2000 hidden neurons, the error-rate was only reduced to 16,8%. The reason is, Covertype has 54 features, so a much larger representation (produced by the hidden neurons) of the dataset is not adding any significant information. Therefore, I will explore other parameters using grid search in the hopes to significantly reduce that error-rate.