February 2017 – Robin on Linux

Performance comparison between CPU and GPU

To compare the performance of floating point arithmetic between Intel CPU and Nvidia GPU, I write some code to do the dot-product operation of two vectors with size of 2GB.
The code for CPU test is using AVX instrument:

void test_avx(float *left, float *right, float *result, size_t count) {
  __m256 *first, *second, *end, *res;
  struct timeval before, after;
  float *c;
  int only = 0, i;
  gettimeofday(&before, NULL);
  for (i = 0; i < LOOP; i++) {
    end = (__m256*)(left + count);
    first = (__m256*)left;
    second = (__m256*)right;
    res = (__m256*)result;
    while (first < end) {
      *res = _mm256_mul_ps(*first, *second);
      if (!only) {
        c = (float*)res;
        printf("[Sample: %f]\n", *c);
        only = 1;
      }
      first ++;
      second ++;
      res ++;
    }
  }
  gettimeofday(&after, NULL);
  printf("AVX:\t%lu\n", after.tv_usec + after.tv_sec * 1000000 -
         (before.tv_usec + before.tv_sec * 1000000));
}

and use

gcc -mavx2 -g -O2 cpu_test.c -o cpu_test

to compile it.
It cost 7.5 seconds to run this test program (LOOP is 10). But my colleague pointed out for me that this program is a "memory-intensive" program as it will sequentially access two 2GB vectors. The access of memory will cost CPU about 200~250 cycles but the _mm256_mul_ps() only cost 5~10 cycles, therefore the primary time has been waste on memory accessing. The effective way to test AVX instrument is using L1-cache of CPU artfully:

void test_avx(float *left, float *right, float *result, size_t count) {
  __m256 *first, *second, *end, *res;
  struct timeval before, after;
  float *c;
  int only = 0, i, j;
  gettimeofday(&before, NULL);
  for (i = 0; i < BUFF_LEN/STRIDE; i++) {
    float *begin = left + (i * STRIDE/sizeof(float));
    for (j = 0; j < LOOP; j++) {
      end = (__m256*)(begin + STRIDE/sizeof(float));
      first = (__m256*)begin;
      second = (__m256*)right;
      res = (__m256*)result;
      while (first < end) {
        *res = _mm256_mul_ps(*first, *second);
        if (!only) {
          c = (float*)res;
          printf("[Sample: %f]\n", *c);
          only = 1;
        }
        first ++;
        second ++;
        res ++;
      }
    }
  }
  gettimeofday(&after, NULL);
  printf("AVX:\t%lu\n", after.tv_usec + after.tv_sec * 1000000 -
         (before.tv_usec + before.tv_sec * 1000000));
}

By chopping vectors into 4K "stride" and repeatedly run AVX instrument on one stride, we can use L1-cache of CPU more intensely. The result is prodigious: it cost only 0.78 seconds, almost ten times faster!
My colleague proceeded recommending me to use MKL (Intel's Match Kernel Library) to test Xeon CPU because it was of many heavily optimizations for Intel-specific-hardware-architecture. In a word, it's better to use library instead of raw code to evaluate performance of CPU and GPU. So finally, I decided to use mxnet to test performance with real data.
Using

sudo make USE_BLAS=cublas USE_CUDA=1 USE_CUDA_PATH=/usr/local/cuda-8.0/ USE_CUDNN=1 USE_MKL2017=1 USE_OPENMP=1 -j80

to build mxnet with cuDNN library (for GPU) and MKL(for CPU), I run my program for bird-classification. And the result shows: the performance of CPU and GPU is about 1 : 5, that GPU is much faster than total CPU-cores in a server.

Use mxnet to classify images of birds (third episode)

After using CNN in previous article, it still can’t recognize the correct name of birds if the little creature stand on the corner (instead of the center) of the whole picture. Then I started to think about the problem: how to let neural-network ignore the position of the bird in picture, but only focus on its exists? Eventually I recollected the “max pooling”:

From: http://mxnet.io/tutorials/python/mnist.html

By choose the max feature value from 2×2 pad, it will amplify the most important feature without affected by backgrounds. For example, if we split a picture into 2×2 chassis (4 plates) and the bird only stand in the first plate, the “max pooling” will choose only the first plate for next processing. Those trees, pools, leaves and other trivial issues in other three plates will be omitted.
Then I modify the structure of CNN again:

def convolution_network():
    data = mx.sym.Variable('data')
    conv1 = mx.sym.Convolution(data=data, kernel=(12, 12), stride=(5, 5), num_filter=128)
    bn1 = mx.sym.BatchNorm(data=conv1, fix_gamma=False, eps=2e-5, momentum=0.9, name="bn1")
    tanh1 = mx.sym.Activation(data=bn1, act_type="relu")
    pool1 = mx.sym.Pooling(data=tanh1, pool_type="max", kernel=(2,2), stride=(2,2))
    conv2 = mx.sym.Convolution(data=pool1, kernel=(12, 12), stride=(5, 5), num_filter=128)
    bn2 = mx.sym.BatchNorm(data=conv2, fix_gamma=False, eps=2e-5, momentum=0.9, name="bn2")
    tanh2 = mx.sym.Activation(data=bn2, act_type="relu")
    pool2 = mx.sym.Pooling(data=tanh2, pool_type="max", kernel=(2,2), stride=(2,2))
    fc3 = mx.sym.FullyConnected(data=pool2, num_hidden=3)
    return mx.sym.SoftmaxOutput(data=fc3, name='softmax')

and using “0.3” for my learning rate, as “0.3” is better to against overfitting.
For one week (Chinese New Year Festival), I was studying “Neural Networks and Deep Learning”. This book is briefly awesome! A lot of doubts about Neural Networks for me have been explained and resolved. In third chapter, the author Michael Nielsen suggests a method, which really enlightened me, to defeat overfitting: artificially expanding training data. The example is rotating the MNIST handwritten digital picture by 15 degrees:

In my case, I decided to crop different parts of bird picture if the picture is a rectangle:

by using the python PIL (Picture Processing Library):

def crop_image(origin, imgs, box):
  result = origin.crop(box)
  result.thumbnail((edge, edge), Image.NEAREST)
  imgs.append(result)
def crop_and_append_image(img, imgs):
    tp = img.getbbox()
    width = tp[2]
    height = tp[3]
    if (width > height):
        sub = width - height
        crop_image(img, imgs, (sub / 2, 0, height + sub / 2, height))
        if (sub >= 80):
          crop_image(img, imgs, (sub / 2 - 40, 0, height + sub / 2 - 40, height))
          crop_image(img, imgs, (sub / 2 + 40, 0, height + sub / 2 + 40, height))
    elif (height > width):
        sub = height - width
        crop_image(img, imgs, (0, sub / 2, width, width + sub / 2))
        if (sub >= 80):
          crop_image(img, imgs, (0, sub / 2 - 40, width, width + sub / 2 - 40))
          crop_image(img, imgs, (0, sub / 2 + 40, width, width + sub / 2 + 40))
    else:
      img.thumbnail((edge, edge), Image.NEAREST)
      imgs.append(img)

The effect of using “max pooling” and “expanding training data” is significant:

Robin on Linux

Monthly Archives: February 2017