Machine Learning

Books I read in year 2019

At the beginning of 2019, I finished the book “The Great Siege: Malta 1565”. The story about a few loyal knights protecting Europe from the Ottoman Empire is so extraordinary that it encouraged me to go on my learning and working about information technology.
To find a new job about Data Engineer or Data Scientist, I almost remembered the whole book of “Hundreds of interviews about machine learning” (Title translated from Chinese). Although I haven’t found a job about machine learning (actually, it’s a job about just damned PHP and Javascript), this book gave me confidence and direction before looking for a new job.
I bought the book “Rats of NIMH” at the end of 2016, and finished reading it after more than two years. In the period, life changed tremendously for me, though I hope the end of it would be as good as the Frisby family.
The most exciting new thing I learned is about NLP in deep learning. After reading the papers about Word2Vec, Transformer, Elmo, BERT, etc. I became very familiar and interesting about NLP.
After started my new job in June 2019, I read the book “Statistical Machine Learning” (Title translated from Chinese) on the commute bus. The bus was very vibrant so I have to read the book for a while and take some rest for my eyes and repeat them. Life is not easy, so I should insist further.

Read paper “Large-Scale Machine Learning with Stochastic Gradient Descent”

Paper reference: Large-Scale Machine Learning with Stochastic Gradient Descent

This GD(Gradient Descent), which is used for computing weight of NN (also used for other Machine Learning Algorithm). z_i represents the example ‘i’, also as (x_i, y_i). After calculate all examples, we need to compute the average for all differentials by weight. Calculating all examples is a slow progress, so we can image GD is not adequate efficient.

SGD

Here comes the SGD, which use only one example to compute gradient. It is simpler, and more efficient.

k-mean

Using SGD in K-mean clustering algorithm seems counterintuitive for me at first glance. But after thinking about “Sample z_i belongs to cluster of w_k, then don’t wait for all samples, just update w_k by z_i“, it becomes conceivable.

ASGD

ASGD is suitable for distributed machine learning environment, since it could get averaged gradient from any example of data at any time (no order restrain).

Use mxnet to classify images of birds (third episode)

After using CNN in previous article, it still can’t recognize the correct name of birds if the little creature stand on the corner (instead of the center) of the whole picture. Then I started to think about the problem: how to let neural-network ignore the position of the bird in picture, but only focus on its exists? Eventually I recollected the “max pooling”:

From: http://mxnet.io/tutorials/python/mnist.html

By choose the max feature value from 2×2 pad, it will amplify the most important feature without affected by backgrounds. For example, if we split a picture into 2×2 chassis (4 plates) and the bird only stand in the first plate, the “max pooling” will choose only the first plate for next processing. Those trees, pools, leaves and other trivial issues in other three plates will be omitted.
Then I modify the structure of CNN again:

def convolution_network():
    data = mx.sym.Variable('data')
    conv1 = mx.sym.Convolution(data=data, kernel=(12, 12), stride=(5, 5), num_filter=128)
    bn1 = mx.sym.BatchNorm(data=conv1, fix_gamma=False, eps=2e-5, momentum=0.9, name="bn1")
    tanh1 = mx.sym.Activation(data=bn1, act_type="relu")
    pool1 = mx.sym.Pooling(data=tanh1, pool_type="max", kernel=(2,2), stride=(2,2))
    conv2 = mx.sym.Convolution(data=pool1, kernel=(12, 12), stride=(5, 5), num_filter=128)
    bn2 = mx.sym.BatchNorm(data=conv2, fix_gamma=False, eps=2e-5, momentum=0.9, name="bn2")
    tanh2 = mx.sym.Activation(data=bn2, act_type="relu")
    pool2 = mx.sym.Pooling(data=tanh2, pool_type="max", kernel=(2,2), stride=(2,2))
    fc3 = mx.sym.FullyConnected(data=pool2, num_hidden=3)
    return mx.sym.SoftmaxOutput(data=fc3, name='softmax')

and using “0.3” for my learning rate, as “0.3” is better to against overfitting.
For one week (Chinese New Year Festival), I was studying “Neural Networks and Deep Learning”. This book is briefly awesome! A lot of doubts about Neural Networks for me have been explained and resolved. In third chapter, the author Michael Nielsen suggests a method, which really enlightened me, to defeat overfitting: artificially expanding training data. The example is rotating the MNIST handwritten digital picture by 15 degrees:

In my case, I decided to crop different parts of bird picture if the picture is a rectangle:

by using the python PIL (Picture Processing Library):

def crop_image(origin, imgs, box):
  result = origin.crop(box)
  result.thumbnail((edge, edge), Image.NEAREST)
  imgs.append(result)
def crop_and_append_image(img, imgs):
    tp = img.getbbox()
    width = tp[2]
    height = tp[3]
    if (width > height):
        sub = width - height
        crop_image(img, imgs, (sub / 2, 0, height + sub / 2, height))
        if (sub >= 80):
          crop_image(img, imgs, (sub / 2 - 40, 0, height + sub / 2 - 40, height))
          crop_image(img, imgs, (sub / 2 + 40, 0, height + sub / 2 + 40, height))
    elif (height > width):
        sub = height - width
        crop_image(img, imgs, (0, sub / 2, width, width + sub / 2))
        if (sub >= 80):
          crop_image(img, imgs, (0, sub / 2 - 40, width, width + sub / 2 - 40))
          crop_image(img, imgs, (0, sub / 2 + 40, width, width + sub / 2 + 40))
    else:
      img.thumbnail((edge, edge), Image.NEAREST)
      imgs.append(img)

The effect of using “max pooling” and “expanding training data” is significant:

My understanding of CNN (Convolutional Neural Network)

The classic Neural Network of Machine Learning usually use fully-connection, which will cost too much computing resource to get final result if the inputs are high-resolution images. So comes the Convolutional Neural Network. CNN (Convolutional Neural Network) splits the whole big image into small pieces (called Receptive Fields), and do some “Convolutional Operations” (actually are some image transformations, also called Kernels) on each Receptive Field, then the pooling operation (usually max-polling, which is simply collect a biggest feature weight in a 2X2 matrix).
Receptive Fields is easy to understand, but why do it use different kind of “Convolutional Operations” on them? In my opinion, “Convolutional Operations” means using different kind of Kernel Functions to transfer the same image (for example: sharpen the image, or detect the edge of object in image), so they could reveal different views of the same image.
These different Kernel Functions review different “Features” of a image, thus we call them “Feature Maps”:
Convolutional Neural Network
From http://mxnet.io/tutorials/python/mnist.html
(The matrix of light-yellow is just transferred from light-gray matrix on its left)
By using Receptive Fields and max-pooling, the number of neurons will become very small gradually, which will make computing (or regression) much more easy and fast:
Convolutional Neural Network
From http://www.cnblogs.com/bzjia-blog/p/3442788.html
Therefore, I reckon the main purpose of using CNN is to reduce the difficulty of computing result of a fully-connected Neural Network.

Using Linear Regression to filter spam message of SMS on Spark

By using the sample from “SMS Spam Collection v. 1“, I write a simple program on Spark to classify normal and spam message.

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.regression.LinearRegressionWithSGD
object SimpleRegression {
  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("Simple Regression")
    val sc = new SparkContext(conf)
    val smsData = sc.textFile("hdfs://127.0.0.1/user/robin/SMSSpamCollection")
    val normal = smsData.filter(line => line.substring(0, 4) == "ham\t")
      .map(line => line.substring(4))
    val spam = smsData.filter(line => line.substring(0, 5) == "spam\t")
      .map(line => line.substring(5))
    // Create a HashingTF instance to map email text to vectors of 10,000 features.
    val tf = new HashingTF(numFeatures = 100000)
    // Each email is split into words, and each word is mapped to one feature.
    val spamFeatures = spam.map(email => tf.transform(email.split(" ")))
    val normalFeatures = normal.map(email => tf.transform(email.split(" ")))
    val positiveExamples = spamFeatures.map(features => LabeledPoint(100, features))
    val negativeExamples = normalFeatures.map(features => LabeledPoint(-100, features))
    val trainingData = positiveExamples.union(negativeExamples)
    trainingData.cache() // Cache since Logistic Regression is an iterative algorithm.
    // Run Linear Regression using the SGD algorithm.
    val model = new LinearRegressionWithSGD().run(trainingData)
    // Test on a positive example (spam) and a negative one (normal).
    val posTest = tf.transform(
      ("Someone has contacted our dating service and entered your phone because they fancy you").split(" "))
    val negTest = tf.transform(
      ("Hi Dady, I started studying Spark the other").split(" "))
    println("Prediction for positive test example: " + model.predict(posTest))
    println("Prediction for negative test example: " + model.predict(negTest))
  }
}

and the “build.sbt” file contains:

lazy val root = (project in file("."))
    .settings(
        name := "test",
        version := "1.0",
        scalaVersion := "2.10.6",
        unmanagedJars in Compile += file("/home/sanbai/hadoop-2.7.2/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar"),
        libraryDependencies ++= Seq(
            "org.apache.spark" % "spark-core_2.10" % "2.0.1",
            "org.apache.spark" % "spark-hive_2.10" % "2.0.1",
            "org.apache.spark" % "spark-mllib_2.10" % "2.0.1",
            "org.apache.spark" % "spark-streaming_2.10" % "2.0.1",
            "org.apache.hadoop" % "hadoop-client" % "2.7.2",
            "org.xerial.snappy" % "snappy-java" % "1.1.2"
        )
    )

After submit the job to YARN:

./bin/spark-submit --class SimpleRegression \
  --master yarn \
  --deploy-mode cluster \
  --driver-memory 2G \
  --executor-memory 14G \
  --executor-cores 1 \
  --num-executors 1 \
  --queue spark \
  /home/sanbai/myspark/target/scala-2.10/test_2.10-1.0.jar

We could retrieve the log of job by:

bin/yarn logs -applicationId application_1473140384986_0096

And the result is:

Prediction for positive test example: 24.238025869328453
Prediction for negative test example: -34.879236141966544

From now on, we can consider the message with negative value as normal and positive value as spam (Or use 10 instead of 0 as boundary).
This is just a example, for the dataset of sample is too small and it could only filter obvious spam message. To identify more spam messages, we need to add more features like ‘the topics of every message’, ‘total number of words’, ‘the frequency of special words’ etc.

Robin on Linux

Machine Learning