Read paper “A Column Store Engine for Real-Time Streaming Analytics” (MemSQL)

Paper reference: A Column Store Engine for Real-Time Streaming Analytics

Background:
According to the official website, MemSQL is “a high performance data warehouse designed for the cloud that delivers ultra fast insights of your live and historical data”. It use row-storage and lock-free engine for data in memory and column-storage for data in disk.
MemSQL could also store all data into disk. In its most durable state, MemSQL will not lose any transactions which have been acknowledged. And it implements “Read Commited” isolation level for transactions.
I heard about MemSQL in early 2012, but don’t know it has became a OLTP-and-OLAP system until several days ago.

What is the problem?
To fulfill OLAP jobs, MemSQL has to store data into disk with columnar-storage-format. But MemSQL still need to process OLTP requests such as random INSERT or UPDATE. Therefore it store data into a data structure named “Segment”. And by connecting different segments into a ordered segments list (which named “Sorted Runs”), MemSQL could balance the requirements between frequent INSERT/UPDATE operations and SCAN/GROUP BY operations.
For example:
1. Users INSERT three keys: 1, 92, 107. MemSQL will create a segment that contains the three keys:
        [1, 92, 107]
2. Users continue to INSERT two keys: 63, 84. The segments list are:
        [1, 92, 107]
        [63, 84]
3. After many INSERT operations, the segments become:
        [1, 92, 107]
        [2, 17, 42]
        [63, 84]
        [110, 118, 172]

Now,MemSQL makes these segments into “Sorted Runs”, which have a basic order for keys:

When the SELECT comes, MemSQL could find the row quickly by just looking up two ordered segment-lists. Uses could also SCAN two segment-lists effectively to do OLAP tasks, since all the data are stored in columnar-format.
What happen if users INSERT more rows? MemSQL will merge the old big Sorted-Runs and create new segment for freshly-insert-data, which could keep the number of Sorted-Runs acceptable.
In practice, MemSQL column store engine uses a constant of 8, so that the biggest sorted run has at least of all the segments, the second biggest sorted run has at least of the remaining segments, and so forth. This strategy seems just like LSM tree in LevelDB of Google. The difference between LevelDB and MemSQL is LevelDB store every key-value pair respectively but MemSQL store a batch of rows into one segment.

If INSERT operations come when MemSQL is merging, the small merging actions will be aborted and relaunch. For big merging actions, it will barely skip any missing or updated segments, for skipping some segments will not ruin the new merged Sorted-Runs.
As we can see, MemSQL endeavors to avoid locking for in-memory data operations, which makes its performance significantly superior.

There are also some practical considerations for MemSQL.

  1. If only one merger are working at big Sort-Runs, it will cost too much time and the small Sorted-Runs will become tremendous for intensive INSERT operations. So MemSQL launch two mergers: Fast Merger and Slow Merger.
  2. MemSQL could accumulates some rows before batching them into a segment, which will decrease the fragment of data.
  3. MemSQL also create special Commands for users to sort all the segments, or just decrease the level of Sorted-Runs.
  4. Columnar-Storage format in memory makes using SIMD instruments of CPU possible. This shed some light on me: may be one day we can run machine learning jobs on MemSQL directly 🙂

Read paper “iShuffle: Improving Hadoop Performance with Shuffle-on-Write”

Paper reference: iShuffle: Improving Hadoop Performance with Shuffle-on-Write

Background:
A job in Hadoop consists of three main stages: map, shuffle, reduce (Actually shuffle stage has been contained into reduce stage).


What is the problem?
Shuffle phase need to migrate large mount of data from nodes which running map job to those nodes which intend to run reduce job. This cause shuffle-latency which is usually significant. And the reason is:

  • Partitioning skew: Hadoop use hash algorithm to organize output data of map task, if too many keys have the same hash, it may cause unevenly size of partitions
  • Coupling of shuffle and reduce: data shuffling can’t be overlapped with map tasks


Solution: iShuffle

    • “Shuffler” collect intermediate data generated by every map task and predict size of respective partition
    • “Shuffler Manager” collect informations from “Shuffler” and decide the position of partitions


    • Shuffle-on-Write:While a map task writing a spill to local filesystem, it will (by modification of Hadoop code) also write spill to correspondent node where reduce task will be launched
    • Automated map out placement: iShuffle will decide the position for every partition by “map selectivity”, which is the ratio of map input size and map output size. After predicting the “map selectivity” and knowing the total input size of data, iShuffle could finally choose optimist node for every partition data


  • Flexible reduce scheduling: when a node request a reduce task, the Task Manager(after modification of Hadoop’s FIFO scheduler) will find the list of partitions reside on this node, and launch reduce task only for these partitions (Make sure reduce task will only read shuffled data from local filesystem which will reduce network-bandwidth at reduce stage)

In my opinion
Using prediction technology to proactively move map output to apt nodes, which avoid partition skew, is the most intelligent part of this paper. This tech could also be used to other intermediate data moving scenario like OLAP in Data-warehouse.
But, I also suspect that in real production, not too many organizations will use this “iShuffle” as they usually run multi-user applications in Hadoop system. When a lot of jobs running in one Hadoop cluster simultaneously, a low peak of CPU-usage made by long reduce latency of one job will be compensated by other computing-intensive jobs. Therefore to all users, none of hardware resources are wasted.

Performance comparison between CPU and GPU

To compare the performance of floating point arithmetic between Intel CPU and Nvidia GPU, I write some code to do the dot-product operation of two vectors with size of 2GB.
The code for CPU test is using AVX instrument:

and use

to compile it.
It cost 7.5 seconds to run this test program (LOOP is 10). But my colleague pointed out for me that this program is a “memory-intensive” program as it will sequentially access two 2GB vectors. The access of memory will cost CPU about 200~250 cycles but the _mm256_mul_ps() only cost 5~10 cycles, therefore the primary time has been waste on memory accessing. The effective way to test AVX instrument is using L1-cache of CPU artfully:

By chopping vectors into 4K “stride” and repeatedly run AVX instrument on one stride, we can use L1-cache of CPU more intensely. The result is prodigious: it cost only 0.78 seconds, almost ten times faster!

My colleague proceeded recommending me to use MKL (Intel’s Match Kernel Library) to test Xeon CPU because it was of many heavily optimizations for Intel-specific-hardware-architecture. In a word, it’s better to use library instead of raw code to evaluate performance of CPU and GPU. So finally, I decided to use mxnet to test performance with real data.

Using

to build mxnet with cuDNN library (for GPU) and MKL(for CPU), I run my program for bird-classification. And the result shows: the performance of CPU and GPU is about 1 : 5, that GPU is much faster than total CPU-cores in a server.

Use mxnet to classify images of birds (third episode)

After using CNN in previous article, it still can’t recognize the correct name of birds if the little creature stand on the corner (instead of the center) of the whole picture. Then I started to think about the problem: how to let neural-network ignore the position of the bird in picture, but only focus on its exists? Eventually I recollected the “max pooling”:



From: http://mxnet.io/tutorials/python/mnist.html

By choose the max feature value from 2×2 pad, it will amplify the most important feature without affected by backgrounds. For example, if we split a picture into 2×2 chassis (4 plates) and the bird only stand in the first plate, the “max pooling” will choose only the first plate for next processing. Those trees, pools, leaves and other trivial issues in other three plates will be omitted.

Then I modify the structure of CNN again:

and using “0.3” for my learning rate, as “0.3” is better to against overfitting.

For one week (Chinese New Year Festival), I was studying “Neural Networks and Deep Learning”. This book is briefly awesome! A lot of doubts about Neural Networks for me have been explained and resolved. In third chapter, the author Michael Nielsen suggests a method, which really enlightened me, to defeat overfitting: artificially expanding training data. The example is rotating the MNIST handwritten digital picture by 15 degrees:


In my case, I decided to crop different parts of bird picture if the picture is a rectangle:


by using the python PIL (Picture Processing Library):

The effect of using “max pooling” and “expanding training data” is significant:



Use mxnet to classify images of birds (second episode)

Using one convolutional-layer and two fully-connected-layers cost too much memory and also have bad performance for training, therefore I modify the model to two convolutional-layers and two narrow fully-connected-layers:

and training it by using learning rate “0.1” instead of “0.01” which may cause “overfit” in neural network.
Finally, the model occupied only 6MB disk space (It was more than 200MB before).

Now I could build a web site on a virtual machine of AliCloud (which is sponsored by Allen Mei, my old colleague) to let users uploading birds’ image and classifying it freely. To thank my sponsor, I named the web site “Allen’s bird” 🙂




In this web, I use angularjs and ngImgCrop plugin from “Alex Kaul”. They are powerful and convenient.

The append() operation of np.array() is very slow.
After replacing np.array() by normal python array, the training script could run much faster now.

Use mxnet to classify images of birds (first episode)

Recently, I was trying to classify images of birds by using machine learning technology. The most familiar deep learning library for me is the mxnet, so I use its python interface to build my Birds-Classification-System.
For having not sufficient number of images for all kinds of bird, I just collect three types of them: “Loggerhead Shrike”, “Anhinga”, and “Eastern Meadowlark”.

Loggerhead Shrike Anhinga Eastern Meadowlark

After collecting more than 800 images of the three kinds of bird, I started to write my python code by learning the “Handwritten Digital Sample” of mxnet step by step.
Firstly, using PIL (Python Image Library) to preprocess these images – chop them from rectangle to square with 100 pixels length of edge:

Then put all images into a numpy array and label them:

Now I can build the Convolutional Neural Network model easily by using the powerful mxnet. The CNN will slice all pictures to 8×8 pixels small chunk with 2 pixels step, therefore enhance the small features of these birds, such as black-eye-mask of Loggerhead-Shrike, yellow neck of Eastern-Meadowlark, etc.

Training the data:

Using GPU for training is extremely fast – it only cost me 5 minutes to train all 800 images, although adjusting the parameters of CNN cost me more than 3 days 🙂

Firstly I use Fully Connected Neural Network, it costs a lot of time for training but prone to overfit. After using the CNN with BatchNorm() in mxnet, the speed of training and affect of classification advanced significantly.
CNN(Convolutional Neural Network) is really a ace in deep learning area for images!

A CUDA program to test performance of GPU

For testing performance of our Nvidia GPU, I have to write my first CUDA program to mutiply two Vectors with each size of 2GB:

Luckily, it works 🙂
The cudaMemcpy() cost about 1 second, but the multiplication of two Vectors cost only 80 micro seconds (even with 10 LOOP as default). Therefore I reckon GPU is perfect for training of Machine Learning, but not promising for predicting when Model has been built.

Note: Use cudaMalloc()/cudaMemcpy() instead of malloc()/memcpy() in Standard C Library, or else the program will not run VecMul<<<>>>

Books I read in year 2016

Here comes the last day of 2016 year. And it is also the time for me to review my harvest about knowledge, or books.

Frankly speaking, the book “All hard thing about hard things” literally frighten me, and cause me to give up any idea about joining a startup company in China. Maybe this is the best consequence, for many startup companies failed in this end of year and I fortunately avoid this tempest.

Diving more deeper into the ocean of “Hadoop Ecosystem”, or “Big Data”, I find out Spark is really a convenient and powerful framework (compare to MapReduce) which could implement complicated algorithm or data-flow with a few lines of code. Surely, Scala is also a key element for Spark’s efficiency and concision.

Today, even normal person could imagine a sci-fi story about how modern people will fight with Alien invaders. But, what will happen if Aliens attacked the earth in the ancient time? What about Medieval age? Then comes the funny and bold sci-fi novel “The High Crusade”. A group of Medieval army defeat the invader of Alien, and did even more: occupied a frontline planet of a gigantic Alien Empire. It is really out of my imagination 🙂

The type of variables in Python

Haven’t written python code for more than one year, I met this simple problem:

Even the code have print out the value of “a” and “b” as 2 and 1, the condition check “if a >= b:” is false!

Spending more than 10 minutes, I eventually get the reason: the type of “a” is “int” but “b” is “string” (and the interpreter of Python will not report any warning about this “inconsistency”). I should have been taking enough care of the type of these variables.
Seems “print” can’t reveal adequate details of a variable, therefore it is highly suggested we using “pprint” instead of “print”.

The result will be

My understanding of CNN (Convolutional Neural Network)

The classic Neural Network of Machine Learning usually use fully-connection, which will cost too much computing resource to get final result if the inputs are high-resolution images. So comes the Convolutional Neural Network. CNN (Convolutional Neural Network) splits the whole big image into small pieces (called Receptive Fields), and do some “Convolutional Operations” (actually are some image transformations, also called Kernels) on each Receptive Field, then the pooling operation (usually max-polling, which is simply collect a biggest feature weight in a 2X2 matrix).

Receptive Fields is easy to understand, but why do it use different kind of “Convolutional Operations” on them? In my opinion, “Convolutional Operations” means using different kind of Kernel Functions to transfer the same image (for example: sharpen the image, or detect the edge of object in image), so they could reveal different views of the same image.
These different Kernel Functions review different “Features” of a image, thus we call them “Feature Maps”:
Convolutional Neural Network
From http://mxnet.io/tutorials/python/mnist.html
(The matrix of light-yellow is just transferred from light-gray matrix on its left)

By using Receptive Fields and max-pooling, the number of neurons will become very small gradually, which will make computing (or regression) much more easy and fast:
Convolutional Neural Network
From http://www.cnblogs.com/bzjia-blog/p/3442788.html

Therefore, I reckon the main purpose of using CNN is to reduce the difficulty of computing result of a fully-connected Neural Network.