tag:blogger.com,1999:blog-6010284863621815385.post67518735399311008..comments2016-03-02T22:06:25.272-08:00Comments on Issam Laradji: Issam Laradjihttp://www.blogger.com/profile/05873212023922145424noreply@blogger.comBlogger2125tag:blogger.com,1999:blog-6010284863621815385.post-13155115682615566722014-06-27T09:06:33.817-07:002014-06-27T09:06:33.817-07:00Hi @Olivier, you are absolutely right - there is n...Hi @Olivier, you are absolutely right - there is no proof that least-squares is always more beneficial and faster than SGD. Like you said, more sophisticated SGDs could be faster. It is just that these claims have been told so many times that I started taking them for granted. I believe the claimed advantages of ELM are meant to apply only in comparison to the traditional gradient descent - the first algorithm to train NNs using backpropagation. Thanks.Issam Laradjihttp://www.blogger.com/profile/05873212023922145424noreply@blogger.comtag:blogger.com,1999:blog-6010284863621815385.post-88220960612067108172014-06-25T04:55:07.729-07:002014-06-25T04:55:07.729-07:00Just a quick comment: don't blindly trust the ...Just a quick comment: don't blindly trust the marketing speech from the ELM website. The following statements are clearly false or at least exaggerated:<br /><br />- "This scheme requires only few matrix operations, making it much faster than gradient descent." => solving a penalized least squares problem (ridge regression) in a machine learning context can sometimes be much faster using stochastic gradient descent depending on the number of samples, number of features, regularization strength, sparsity and conditioning of the data. The analytical formulation of the ridge regression estimator involves "only few matrix operations" but one of them is a matrix inversion and in practice damped least squares are solved using specialized solvers such as: http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.linalg.lsqr.html which is used internally by the sklearn.linear_model.Ridge estimator.<br /><br />- "It also has a strong generalization power since it uses least-squares to find its solutions." => no loss function is uniformly better than any other. There is no guaranty that optimizing a classifier for least squares (in a one vs all multiclass setting) is actually better than the logistic or hinge losses in OvA or the cross-entropy loss. I would even say, quite the contrary, although the loss is generally not as impacting as data preprocessing issues and hyperparameter tuning such as the number of hidden nodes and the regularization strength of the classifier.Olivier Griselhttp://www.blogger.com/profile/05751090858946703320noreply@blogger.com