Read paper “Large-Scale Machine Learning with Stochastic Gradient Descent”

Paper reference: Large-Scale Machine Learning with Stochastic Gradient Descent


This GD(Gradient Descent), which is used for computing weight of NN (also used for other Machine Learning Algorithm). zi represents the example ‘i’, also as (xi, yi). After calculate all examples, we need to compute the average for all differentials by weight. Calculating all examples is a slow progress, so we can image GD is not adequate efficient.


Here comes the SGD, which use only one example to compute gradient. It is simpler, and more efficient.


Using SGD in K-mean clustering algorithm seems counterintuitive for me at first glance. But after thinking about “Sample zi belongs to cluster of wk, then don’t wait for all samples, just update wk by zi“, it becomes conceivable.


ASGD is suitable for distributed machine learning environment, since it could get averaged gradient from any example of data at any time (no order restrain).

Use mxnet to classify images of birds (third episode)

After using CNN in previous article, it still can’t recognize the correct name of birds if the little creature stand on the corner (instead of the center) of the whole picture. Then I started to think about the problem: how to let neural-network ignore the position of the bird in picture, but only focus on its exists? Eventually I recollected the “max pooling”:


By choose the max feature value from 2×2 pad, it will amplify the most important feature without affected by backgrounds. For example, if we split a picture into 2×2 chassis (4 plates) and the bird only stand in the first plate, the “max pooling” will choose only the first plate for next processing. Those trees, pools, leaves and other trivial issues in other three plates will be omitted.

Then I modify the structure of CNN again:

and using “0.3” for my learning rate, as “0.3” is better to against overfitting.

For one week (Chinese New Year Festival), I was studying “Neural Networks and Deep Learning”. This book is briefly awesome! A lot of doubts about Neural Networks for me have been explained and resolved. In third chapter, the author Michael Nielsen suggests a method, which really enlightened me, to defeat overfitting: artificially expanding training data. The example is rotating the MNIST handwritten digital picture by 15 degrees:

In my case, I decided to crop different parts of bird picture if the picture is a rectangle:

by using the python PIL (Picture Processing Library):

The effect of using “max pooling” and “expanding training data” is significant:

My understanding of CNN (Convolutional Neural Network)

The classic Neural Network of Machine Learning usually use fully-connection, which will cost too much computing resource to get final result if the inputs are high-resolution images. So comes the Convolutional Neural Network. CNN (Convolutional Neural Network) splits the whole big image into small pieces (called Receptive Fields), and do some “Convolutional Operations” (actually are some image transformations, also called Kernels) on each Receptive Field, then the pooling operation (usually max-polling, which is simply collect a biggest feature weight in a 2X2 matrix).

Receptive Fields is easy to understand, but why do it use different kind of “Convolutional Operations” on them? In my opinion, “Convolutional Operations” means using different kind of Kernel Functions to transfer the same image (for example: sharpen the image, or detect the edge of object in image), so they could reveal different views of the same image.
These different Kernel Functions review different “Features” of a image, thus we call them “Feature Maps”:
Convolutional Neural Network
(The matrix of light-yellow is just transferred from light-gray matrix on its left)

By using Receptive Fields and max-pooling, the number of neurons will become very small gradually, which will make computing (or regression) much more easy and fast:
Convolutional Neural Network

Therefore, I reckon the main purpose of using CNN is to reduce the difficulty of computing result of a fully-connected Neural Network.

Using Linear Regression to filter spam message of SMS on Spark

By using the sample from “SMS Spam Collection v. 1“, I write a simple program on Spark to classify normal and spam message.

and the “build.sbt” file contains:

After submit the job to YARN:

We could retrieve the log of job by:

And the result is:

From now on, we can consider the message with negative value as normal and positive value as spam (Or use 10 instead of 0 as boundary).
This is just a example, for the dataset of sample is too small and it could only filter obvious spam message. To identify more spam messages, we need to add more features like ‘the topics of every message’, ‘total number of words’, ‘the frequency of special words’ etc.