mxnet

Hard training works in deep learning

This week, I was trying to train two deep-learning models. They are different from my previous training job: they are really hard to converge to a small ‘loss’.
The first model is about bird image classification. Previously we wrote a modified Resnet-50 model by using MXNet and could use it to reach 78% evaluation-accuracy. But after we rewrote the same model by using Tensorflow, it could only reach 50% evaluation-accuracy, which seems very weird. The first thing that in my mind is that it’s a regularization problem, so I randomly pad/crop and rotate the training images:

  image = tf.image.resize_image_with_crop_or_pad(image, IMAGE_HEIGHT + 80, IMAGE_WIDTH + 80)
  image = tf.contrib.image.rotate(image, tf.random_uniform([1], minval = -math.pi / 3.0, maxval = math.pi / 3.0))
  image = tf.random_crop(image, [IMAGE_HEIGHT, IMAGE_WIDTH, IMAGE_CHANNELS])

By data augmentation, the evaluation accuracy rise to about 60%, but still far from the result of MXNet.
Then I change the optimizer from AdamOptimizer to GradientDescentOptimizer, since my colleague tell me the AdamOptimizer is too powerful that it tends to cause overfit. And I also add ‘weight_decay’ for my Resnet-50 model. This time, the evaluation accuracy shrived to 76%. The affection of ‘weight_decay’ is significantly positive.
The second model is about object detection. We just use the example of Tensorflow’s model library. It includes many cutting-edge models to implement object detection. I just want to try SSD(Single Shot Detection) on MobileNetV2:

python object_detection/train.py \
  --logtostderr \
  --pipeline_config_path=/disk3/donghao/models/research/object_detection/samples/configs/ssd_mobilenet_v2_coco.config \
  --train_dir=/disk3/donghao/myckpt/ \
  --num_clones=2

The loss is rapidly reducing from hundreds to twelve, but stay at eleven for a very long time. The loss looks like will stay here forever. Then I begin to adjust hyper-parameters. After testing several learning rates and optimizers, the results doesn’t change at all.
Eventually, I noticed that the loss doesn’t stay forever, it WILL REDUCE AT LAST. For some tasks such as classification, its loss will converge significantly. But for other tasks such as object detection, its loss will shrink at extremely slow speed. Use AdamOptimizer and small learning rate is a better choice for this type of task.

Performance problem for training images on MXNet

After running my MXNet application like this snippet:

MXNET_CUDNN_AUTOTUNE_DEFAULT=0 python bird.py train 0.8 &> log

I found out that the training speed is only 300 samples per second, and the usage of GPU looks very strange:

# nvidia-smi -l |grep Default
| N/A   48C    P0   184W / 250W |   4971MiB / 16276MiB |     98%      Default |
| N/A   44C    P0   145W / 250W |   4971MiB / 16276MiB |     98%      Default |
| N/A   47C    P0    44W / 250W |   4971MiB / 16276MiB |     98%      Default |
| N/A   43C    P0    40W / 250W |   4971MiB / 16276MiB |     98%      Default |
| N/A   45C    P0   182W / 250W |   4971MiB / 16276MiB |      0%      Default |
| N/A   42C    P0   182W / 250W |   4971MiB / 16276MiB |      0%      Default |
| N/A   44C    P0    42W / 250W |   4971MiB / 16276MiB |      0%      Default |
| N/A   42C    P0    41W / 250W |   4971MiB / 16276MiB |      0%      Default |
| N/A   45C    P0    42W / 250W |   4971MiB / 16276MiB |      0%      Default |
| N/A   42C    P0    38W / 250W |   4971MiB / 16276MiB |      0%      Default |
| N/A   45C    P0    42W / 250W |   4971MiB / 16276MiB |     21%      Default |
| N/A   42C    P0    38W / 250W |   4971MiB / 16276MiB |     17%      Default |
| N/A   45C    P0    42W / 250W |   4971MiB / 16276MiB |     48%      Default |
| N/A   42C    P0    38W / 250W |   4971MiB / 16276MiB |     44%      Default |
| N/A   45C    P0    42W / 250W |   4971MiB / 16276MiB |      0%      Default |
| N/A   42C    P0    38W / 250W |   4971MiB / 16276MiB |      0%      Default |
| N/A   45C    P0    42W / 250W |   4971MiB / 16276MiB |     41%      Default |
| N/A   42C    P0    38W / 250W |   4971MiB / 16276MiB |     36%      Default |
| N/A   45C    P0    43W / 250W |   4971MiB / 16276MiB |     98%      Default |
| N/A   42C    P0    38W / 250W |   4971MiB / 16276MiB |     98%      Default |

About two days later, I just noticed that there are some messages reported by MXNet:

INFO:root:Using 1 threads for decoding...
INFO:root:Set enviroment variable MXNET_CPU_WORKER_NTHREADS to a larger number to use more threads.

After changing my command to:

MXNET_CPU_WORKER_NTHREADS=16 MXNET_CUDNN_AUTOTUNE_DEFAULT=0 python bird.py train 0.8 &> log

the speed of training has changed to 690 samples per second, and the usage of GPU became much smoothly since it could use more CPUs to decode image now:

# nvidia-smi -l |grep Default
| N/A   40C    P0   173W / 250W |   4971MiB / 16276MiB |     96%      Default |
| N/A   45C    P0   182W / 250W |   4971MiB / 16276MiB |     98%      Default |
| N/A   42C    P0   183W / 250W |   4971MiB / 16276MiB |     96%      Default |
| N/A   46C    P0   163W / 250W |   4971MiB / 16276MiB |     98%      Default |
| N/A   43C    P0   153W / 250W |   4971MiB / 16276MiB |     98%      Default |
| N/A   48C    P0   181W / 250W |   4971MiB / 16276MiB |     96%      Default |
| N/A   44C    P0   168W / 250W |   4971MiB / 16276MiB |     96%      Default |
| N/A   49C    P0   190W / 250W |   4971MiB / 16276MiB |     98%      Default |
| N/A   45C    P0   181W / 250W |   4971MiB / 16276MiB |     96%      Default |
| N/A   50C    P0   136W / 250W |   4971MiB / 16276MiB |     98%      Default |
| N/A   46C    P0   138W / 250W |   4971MiB / 16276MiB |     98%      Default |
| N/A   51C    P0   186W / 250W |   4971MiB / 16276MiB |     96%      Default |
| N/A   47C    P0   161W / 250W |   4971MiB / 16276MiB |     96%      Default |
| N/A   52C    P0   212W / 250W |   4971MiB / 16276MiB |     98%      Default |
| N/A   48C    P0   192W / 250W |   4971MiB / 16276MiB |     96%      Default |
| N/A   52C    P0   155W / 250W |   4971MiB / 16276MiB |     98%      Default |
| N/A   48C    P0   152W / 250W |   4971MiB / 16276MiB |     98%      Default |
| N/A   54C    P0   180W / 250W |   4971MiB / 16276MiB |     96%      Default |
| N/A   49C    P0   166W / 250W |   4971MiB / 16276MiB |     96%      Default |
| N/A   54C    P0   194W / 250W |   4971MiB / 16276MiB |     98%      Default |

Fix Resnet-101 model in example of MXNET

SSD(Single Shot MultiBox Detector) is the fastest method in object-detection task (Another detector YOLO, is a little bit slower than SSD). In the source code of MXNET，there is an example for SSD implementation. I test it by using different models: inceptionv3, resnet-50, resnet-101 etc. and find a weird phenomenon: the size .params file generated by resnet-101 is smaller than resnet-50.

Model	Size of .params file
resnet-50	119MB
resnet-101	69MB

Since deeper network have larger number of parameters, resnet-101 has smaller file size for parameters seems suspicious.
Reviewing the code of example/ssd/symbol/symbol_factory.py:

    elif network == 'resnet50':
        num_layers = 50
        image_shape = '3,224,224'  # resnet require it as shape check
        network = 'resnet'
        from_layers = ['_plus12', '_plus15', '', '', '', '']
        num_filters = [-1, -1, 512, 256, 256, 128]
        strides = [-1, -1, 2, 2, 2, 2]
        pads = [-1, -1, 1, 1, 1, 1]
        sizes = [[.1, .141], [.2,.272], [.37, .447], [.54, .619], [.71, .79], [.88, .961]]
        ratios = [[1,2,.5], [1,2,.5,3,1./3], [1,2,.5,3,1./3], [1,2,.5,3,1./3], \
            [1,2,.5], [1,2,.5]]
        normalizations = -1
        steps = []
        return locals()
    elif network == 'resnet101':
        num_layers = 101
        image_shape = '3,224,224'
        network = 'resnet'
        from_layers = ['_plus12', '_plus15', '', '', '', '']
        num_filters = [-1, -1, 512, 256, 256, 128]
        strides = [-1, -1, 2, 2, 2, 2]
        pads = [-1, -1, 1, 1, 1, 1]
        sizes = [[.1, .141], [.2,.272], [.37, .447], [.54, .619], [.71, .79], [.88, .961]]
        ratios = [[1,2,.5], [1,2,.5,3,1./3], [1,2,.5,3,1./3], [1,2,.5,3,1./3], \
            [1,2,.5], [1,2,.5]]
        normalizations = -1
        steps = []
        return locals()

Why resnet-50 and resnet-101 has the same ‘from_layers’ ? Let’s check these two models:

In resnet-50, the SSD use two layers (as show in red line) to extract features. One from output of stage-3, another from output of stage-4. In resnet-101, it should be the same (as show in blue line), but it incorrectly copy the config code of resnet-50. The correct ‘from_layers’ for resnet-100 is:

from_layers = ['_plus29', '_plus32', '', '', '', '']

This seems like a bug, so I create a pull request to try fixing it.

Training DNN with less memory cost

The paper “Training Deep Nets with Sublinear Memory Cost” tells us a practical method to train DNN with far less memory cost. The mechanism behind is not difficult to understand: when training a deep network (a computing graph), we have to store temporary data in every node, which will occupy extra memory. Actually, we could remove these temporary data after computing each node, and compute them again in back-propagation period. It’s a tradeoff between computing time and computing space.
The author give us an example in MXNET. The improvement of memory-reducing seems tremendous.
Above the version 1.3, tensorflow also brought a similar module: memory optimizer. We can use it like this:

from tensorflow.core.protobuf import rewriter_config_pb2
....
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
config.allow_soft_placement = True
config.graph_options.rewrite_options.memory_optimization = memory_optimization=rewriter_config_pb2.RewriterConfig.MANUAL
with tf.Session(config=config) as sess:
....

Still need to add op in Resnet:

# Add 'reshape' with a special name at the end of every residual-unit
shape = x.get_shape()
dims = []
for i in range(shape.ndims):
    dims.append(shape.dims[i].value)
x = tf.reshape(x, dims, name='robin_137')
....
def ignore(op):
    return op.type in ['Const', 'VariableV2', 'Identity', 'Assign', 'Placeholder', 'RandomShuffleQueueV2', 'QueueEnqueueV2', 'QueueDequeueManyV2']
def checkpoint(op):
    m = re.compile(".*robin_137").match(op.name)
    if m:
        return True
    return False
# Set attribute here
ops = tf.get_default_graph().get_operations()
mirrors = filter(lambda op: not ignore(op) and not checkpoint(op), ops)
for op in mirrors:
    op.node_def.attr['_recompute_hint'].i = 0

By using this method, we could increase batch-size even in deep network (Resnet-101 etc.) now.

The performance of R-CNN in mxnet

We are trying to use faster R-CNN network (also is an example in mxnet) to automatically extract bird from pictures. But it will cost 10 seconds to recognize a bird from a picture by using CPU, which is too slow to be used in product environment. To improve the performance, I download the MKL with version-2017u4 from Intel site and install it in the server. After recompile mxnet:

make clean
make USE_BLAS=openblas USE_CUDNN=1 USE_CUDA_PATH=/usr/local/cuda-8.0/ USE_CUDA=1 USE_MKL2017=1 -j

it only cost 3~4 seconds to recognize bird from picture. MKL really works!
Using GPU to do inference is a another option. But a EC2 instance with a GPU device is much more expensive than a normal EC2 instance. So we will still using CPU in the near future.

Price of EC2 instance in US-West(Oregon)

	vCPU	ECU	Memory (GiB)	Instance Storage (GB)	Linux/UNIX Usage
t2.large	2	Variable	8	EBS Only	$0.0928 per Hour
g2.2xlarge	8	26	15	60 SSD	$0.65 per Hour

Performance comparison between CPU and GPU

To compare the performance of floating point arithmetic between Intel CPU and Nvidia GPU, I write some code to do the dot-product operation of two vectors with size of 2GB.
The code for CPU test is using AVX instrument:

void test_avx(float *left, float *right, float *result, size_t count) {
  __m256 *first, *second, *end, *res;
  struct timeval before, after;
  float *c;
  int only = 0, i;
  gettimeofday(&before, NULL);
  for (i = 0; i < LOOP; i++) {
    end = (__m256*)(left + count);
    first = (__m256*)left;
    second = (__m256*)right;
    res = (__m256*)result;
    while (first < end) {
      *res = _mm256_mul_ps(*first, *second);
      if (!only) {
        c = (float*)res;
        printf("[Sample: %f]\n", *c);
        only = 1;
      }
      first ++;
      second ++;
      res ++;
    }
  }
  gettimeofday(&after, NULL);
  printf("AVX:\t%lu\n", after.tv_usec + after.tv_sec * 1000000 -
         (before.tv_usec + before.tv_sec * 1000000));
}

and use

gcc -mavx2 -g -O2 cpu_test.c -o cpu_test

to compile it.
It cost 7.5 seconds to run this test program (LOOP is 10). But my colleague pointed out for me that this program is a "memory-intensive" program as it will sequentially access two 2GB vectors. The access of memory will cost CPU about 200~250 cycles but the _mm256_mul_ps() only cost 5~10 cycles, therefore the primary time has been waste on memory accessing. The effective way to test AVX instrument is using L1-cache of CPU artfully:

void test_avx(float *left, float *right, float *result, size_t count) {
  __m256 *first, *second, *end, *res;
  struct timeval before, after;
  float *c;
  int only = 0, i, j;
  gettimeofday(&before, NULL);
  for (i = 0; i < BUFF_LEN/STRIDE; i++) {
    float *begin = left + (i * STRIDE/sizeof(float));
    for (j = 0; j < LOOP; j++) {
      end = (__m256*)(begin + STRIDE/sizeof(float));
      first = (__m256*)begin;
      second = (__m256*)right;
      res = (__m256*)result;
      while (first < end) {
        *res = _mm256_mul_ps(*first, *second);
        if (!only) {
          c = (float*)res;
          printf("[Sample: %f]\n", *c);
          only = 1;
        }
        first ++;
        second ++;
        res ++;
      }
    }
  }
  gettimeofday(&after, NULL);
  printf("AVX:\t%lu\n", after.tv_usec + after.tv_sec * 1000000 -
         (before.tv_usec + before.tv_sec * 1000000));
}

By chopping vectors into 4K "stride" and repeatedly run AVX instrument on one stride, we can use L1-cache of CPU more intensely. The result is prodigious: it cost only 0.78 seconds, almost ten times faster!
My colleague proceeded recommending me to use MKL (Intel's Match Kernel Library) to test Xeon CPU because it was of many heavily optimizations for Intel-specific-hardware-architecture. In a word, it's better to use library instead of raw code to evaluate performance of CPU and GPU. So finally, I decided to use mxnet to test performance with real data.
Using

sudo make USE_BLAS=cublas USE_CUDA=1 USE_CUDA_PATH=/usr/local/cuda-8.0/ USE_CUDNN=1 USE_MKL2017=1 USE_OPENMP=1 -j80

to build mxnet with cuDNN library (for GPU) and MKL(for CPU), I run my program for bird-classification. And the result shows: the performance of CPU and GPU is about 1 : 5, that GPU is much faster than total CPU-cores in a server.

Use mxnet to classify images of birds (third episode)

After using CNN in previous article, it still can’t recognize the correct name of birds if the little creature stand on the corner (instead of the center) of the whole picture. Then I started to think about the problem: how to let neural-network ignore the position of the bird in picture, but only focus on its exists? Eventually I recollected the “max pooling”:

From: http://mxnet.io/tutorials/python/mnist.html

By choose the max feature value from 2×2 pad, it will amplify the most important feature without affected by backgrounds. For example, if we split a picture into 2×2 chassis (4 plates) and the bird only stand in the first plate, the “max pooling” will choose only the first plate for next processing. Those trees, pools, leaves and other trivial issues in other three plates will be omitted.
Then I modify the structure of CNN again:

def convolution_network():
    data = mx.sym.Variable('data')
    conv1 = mx.sym.Convolution(data=data, kernel=(12, 12), stride=(5, 5), num_filter=128)
    bn1 = mx.sym.BatchNorm(data=conv1, fix_gamma=False, eps=2e-5, momentum=0.9, name="bn1")
    tanh1 = mx.sym.Activation(data=bn1, act_type="relu")
    pool1 = mx.sym.Pooling(data=tanh1, pool_type="max", kernel=(2,2), stride=(2,2))
    conv2 = mx.sym.Convolution(data=pool1, kernel=(12, 12), stride=(5, 5), num_filter=128)
    bn2 = mx.sym.BatchNorm(data=conv2, fix_gamma=False, eps=2e-5, momentum=0.9, name="bn2")
    tanh2 = mx.sym.Activation(data=bn2, act_type="relu")
    pool2 = mx.sym.Pooling(data=tanh2, pool_type="max", kernel=(2,2), stride=(2,2))
    fc3 = mx.sym.FullyConnected(data=pool2, num_hidden=3)
    return mx.sym.SoftmaxOutput(data=fc3, name='softmax')

and using “0.3” for my learning rate, as “0.3” is better to against overfitting.
For one week (Chinese New Year Festival), I was studying “Neural Networks and Deep Learning”. This book is briefly awesome! A lot of doubts about Neural Networks for me have been explained and resolved. In third chapter, the author Michael Nielsen suggests a method, which really enlightened me, to defeat overfitting: artificially expanding training data. The example is rotating the MNIST handwritten digital picture by 15 degrees:

In my case, I decided to crop different parts of bird picture if the picture is a rectangle:

by using the python PIL (Picture Processing Library):

def crop_image(origin, imgs, box):
  result = origin.crop(box)
  result.thumbnail((edge, edge), Image.NEAREST)
  imgs.append(result)
def crop_and_append_image(img, imgs):
    tp = img.getbbox()
    width = tp[2]
    height = tp[3]
    if (width > height):
        sub = width - height
        crop_image(img, imgs, (sub / 2, 0, height + sub / 2, height))
        if (sub >= 80):
          crop_image(img, imgs, (sub / 2 - 40, 0, height + sub / 2 - 40, height))
          crop_image(img, imgs, (sub / 2 + 40, 0, height + sub / 2 + 40, height))
    elif (height > width):
        sub = height - width
        crop_image(img, imgs, (0, sub / 2, width, width + sub / 2))
        if (sub >= 80):
          crop_image(img, imgs, (0, sub / 2 - 40, width, width + sub / 2 - 40))
          crop_image(img, imgs, (0, sub / 2 + 40, width, width + sub / 2 + 40))
    else:
      img.thumbnail((edge, edge), Image.NEAREST)
      imgs.append(img)

The effect of using “max pooling” and “expanding training data” is significant:

Use mxnet to classify images of birds (second episode)

Using one convolutional-layer and two fully-connected-layers cost too much memory and also have bad performance for training, therefore I modify the model to two convolutional-layers and two narrow fully-connected-layers:

def convolution_network():
    data = mx.sym.Variable('data')
    conv1 = mx.sym.Convolution(data=data, kernel=(10, 10), stride=(2,2), num_filter=8)
    bn1 = mx.sym.BatchNorm(data=conv1, fix_gamma=False, eps=2e-5, momentum=0.9, name="bn1")
    tanh1 = mx.sym.Activation(data=bn1, act_type="relu")
    conv2 = mx.sym.Convolution(data=tanh1, kernel=(10, 10), stride=(2,2), num_filter=8)
    bn2 = mx.sym.BatchNorm(data=conv2, fix_gamma=False, eps=2e-5, momentum=0.9, name="bn2")
    tanh2 = mx.sym.Activation(data=bn2, act_type="relu")
    flatten = mx.sym.Flatten(data=tanh2)
    fc1 = mx.sym.FullyConnected(data=flatten, num_hidden=100)
    tanh3 = mx.sym.Activation(data=fc1, act_type="relu")
    fc2 = mx.sym.FullyConnected(data=tanh3, num_hidden=3)
    return mx.sym.SoftmaxOutput(data=fc2, name='softmax')

and training it by using learning rate “0.1” instead of “0.01” which may cause “overfit” in neural network.
Finally, the model occupied only 6MB disk space (It was more than 200MB before).
Now I could build a web site on a virtual machine of AliCloud (which is sponsored by Allen Mei, my old colleague) to let users uploading birds’ image and classifying it freely. To thank my sponsor, I named the web site “Allen’s bird” 🙂

In this web, I use angularjs and ngImgCrop plugin from “Alex Kaul”. They are powerful and convenient.

The append() operation of np.array() is very slow.
After replacing np.array() by normal python array, the training script could run much faster now.

Use mxnet to classify images of birds (first episode)

Recently, I was trying to classify images of birds by using machine learning technology. The most familiar deep learning library for me is the mxnet, so I use its python interface to build my Birds-Classification-System.
For having not sufficient number of images for all kinds of bird, I just collect three types of them: “Loggerhead Shrike”, “Anhinga”, and “Eastern Meadowlark”.

Loggerhead Shrike

Anhinga

Eastern Meadowlark

After collecting more than 800 images of the three kinds of bird, I started to write my python code by learning the “Handwritten Digital Sample” of mxnet step by step.
Firstly, using PIL (Python Image Library) to preprocess these images – chop them from rectangle to square with 100 pixels length of edge:

edge = 100
......
def process_image(file_name):
    img = Image.open(file_name)
    tp = img.getbbox()
    width = tp[2]
    height = tp[3]
    if (width > height):
        sub = width - height
        img = img.crop((sub / 2, 0, height + sub / 2, height))
    elif (height > width):
        sub = height - width
        img = img.crop((0, sub / 2, width, width + sub / 2))
    img.thumbnail((edge, edge), Image.NEAREST)
    return img

Then put all images into a numpy array and label them:

images = np.array([])
labels = np.array([])
nr_images = 0
......
      for file in files:
        img = process_image(bird_dir + file)
        global images, labels, nr_images
        if array.shape == (edge, edge, 3):
          images = np.append(images, np.asarray(img))
          labels = np.append(labels, lab)
          nr_images = nr_images + 1

Now I can build the Convolutional Neural Network model easily by using the powerful mxnet. The CNN will slice all pictures to 8×8 pixels small chunk with 2 pixels step, therefore enhance the small features of these birds, such as black-eye-mask of Loggerhead-Shrike, yellow neck of Eastern-Meadowlark, etc.

def convolution_network():
    data = mx.sym.Variable('data')
    conv1 = mx.sym.Convolution(data=data, kernel=(8, 8), stride=(2,2), num_filter=8)
    bn1 = mx.sym.BatchNorm(data=conv1, fix_gamma=False, eps=2e-5, momentum=0.9, name="bn1")
    tanh1 = mx.sym.Activation(data=bn1, act_type="relu")
    flatten = mx.sym.Flatten(data=tanh1)
    fc1 = mx.sym.FullyConnected(data=flatten, num_hidden=1000)
    tanh3 = mx.sym.Activation(data=fc1, act_type="relu")
    fc3 = mx.sym.FullyConnected(data=tanh3, num_hidden=3)
    return mx.sym.SoftmaxOutput(data=fc3, name='softmax')

Training the data:

images = np.array(images).reshape(nr_images, 3, edge, edge).astype(np.float32)/255
batch_size = 200
train_iter = mx.io.NDArrayIter(images, labels, batch_size, shuffle=True)
mlp = convolution_network()
model = mx.model.FeedForward(
        ctx = mx.gpu(0),
        symbol = mlp,
        num_epoch = 40,
        learning_rate = 0.02
model.fit(
        X = train_iter,
        batch_end_callback = mx.callback.Speedometer(batch_size, 1))

Using GPU for training is extremely fast – it only cost me 5 minutes to train all 800 images, although adjusting the parameters of CNN cost me more than 3 days 🙂

Firstly I use Fully Connected Neural Network, it costs a lot of time for training but prone to overfit. After using the CNN with BatchNorm() in mxnet, the speed of training and affect of classification advanced significantly.
CNN(Convolutional Neural Network) is really a ace in deep learning area for images!