Read paper “Large-Scale Machine Learning with Stochastic Gradient Descent”

Paper reference: Large-Scale Machine Learning with Stochastic Gradient Descent


This GD(Gradient Descent), which is used for computing weight of NN (also used for other Machine Learning Algorithm). zi represents the example ‘i’, also as (xi, yi). After calculate all examples, we need to compute the average for all differentials by weight. Calculating all examples is a slow progress, so we can image GD is not adequate efficient.


Here comes the SGD, which use only one example to compute gradient. It is simpler, and more efficient.


Using SGD in K-mean clustering algorithm seems counterintuitive for me at first glance. But after thinking about “Sample zi belongs to cluster of wk, then don’t wait for all samples, just update wk by zi“, it becomes conceivable.


ASGD is suitable for distributed machine learning environment, since it could get averaged gradient from any example of data at any time (no order restrain).

Data Preprocessing in Tableau

In previous article, I created two tables in my Redshift Cluster. Now I wan’t to find out the relation between salary of every employee and their working age. Tableau is the best choice for visualizing data analysis (SAS is too expensive and has no trail-version for learning).
First, we connect to Redshift in Tableau, and double-click the “New Custom SQL”. In the popup window, type in our SQL to query first-year-salary of every employee:

Now we have the table “custom sql query”. Drag in table “salary”, and choose “inner join” for employee_id, start_date:

Click into the “Sheet 1”. Drag “salary” to “Rows”, “min_start_date” to “Columns”, and “employee_id” to “Color” in “Marks” panel.

Now we can see the “expensive employees” (who have the most high salary in the same first-year) on the top of the graph:

Instead of adding custom SQL in tableau datasource panel, we can also create view in Redshift, and let tableau show views in “Tables”.

Or using “WITH” clause

Example datasets for Amazon RedShift

Last year, I imported two datasets to Hive. Currently, I will load two these two datasets into Amazon RedShift instead.
After created a RedShift Cluster in my VPC, I couldn’t connect to it even with Elastic IP. Then I check the parameters of my VPC between AWS’s default VPC, and eventually saw the vital differences. First, set “Network ACL” in “VPC” of AWS:

Then, add rule in “Route table”, which let node to access Anywhere( through “Internet Gateway” (also created in “VPC” service):

Now I could connect to my RedShift cluster.

Create s3 bucket by AWS Cli:

Upload two csv files into bucekt:

Create tables in Redshift by using SQL-Bench:

Don’t put blank space or tab(‘\t’) before column name when creating table. or else Redshift will consider column name as
”     employee_id”
”     salary”

Load data from s3 to RedShift by COPY, the powerful tool for ETL in AWS.

We could see the success report like this:

There are “Warnings” but “successfully”, a little weird. But don’t worry, it’s ok for SQL-Bench.

Currently we could run this script which was wrote last year (But need to change ‘==’ to ‘=’ for compatible problem):

The result is

Enable audit log for AWS Redshift

When I was trying to enable the Audit Log for AWS Redshift, I chose to use a exists bucket in S3. But it report error:

"Cannot read ACLs of bucket redshift-robin. Please ensure that your IAM permissions are set up correctly."
"Service: AmazonRedshift; Status Code: 400; Error Code: InsufficientS3BucketPolicyFault ...."

According to this document, I need to change permission of bucket "redshift-robin". So I entered the AWS Console of S3, click bucket name of "redshift-robin" in left panel, and saw description of permissions:

Press "Add Bucket Policy", and in the pop-out-window, press "AWS Policy Generator". Here came the generator, which is easy to use for creating policy.
Add two policy for "redshift-robin":

The "902366379725" is the account-id of us-west-2 region (Oregon)

Click "Generate Policy", and copy the generated JSON to "Bucket Policy Editor":

Press "Save". Now, we could enable Audit Log of Redshift for bucket "redshift-robin":

Read paper “In-Datacenter Performance Analysis of a Tensor Processing Unit”

Paper reference: In-Datacenter Performance Analysis of a Tensor Processing Unit”

Using floating point (16bit or 32bit) for NN (Neural Network) training, then a step called quantization transforms floating-point numbers into narrow integers–often just 8 bits–which are usually good enough for inference.
MLP(Multi-layer Perceptions), CNN(Convolutional Neural Netowrks), and RNN(Recurrent Neural Networks), these three types of NN represent 95% of NN inference workload in Google datacenter. Therefore, the TPU mainly focus on them.

As we can see, CNNs are usually dense-computing NN, which are better for TPU.

TPU has 25 times as many MACs (Multiply and Accumulate) and 3.5 times as much on-chip memory as the K80 GPU.

The TPU was designed to be a coprocessor on the PCIe I/O bus, more like FPU(floating-poin unit) than it is to a GPU.

The parameters of NN model (weights) comes from off-chip memory (8G DDR3 DRAM) to Weight FIFO, and then flow into MMU(Matrix Multiply Unit). The request (sample need to be inference) comes from PCIe to Unified Buffer, and also flow into MMU finally.
Even the “Activation” and “Pooling” algorithm in CNN have been fixed into hardware.

The MMU contains 256×256 MACs that can perform 8-bit multiply-and-adds on signed or unsigned integers.

According to this Floor Plan, we can imaging that UB and MMU might cost most energy of TPU.

TPU instructions follow the CISC tradition and only has about a dozen instructions, include “Read_Host_Memory”, “Read_Weights”, “MatrixMultiply”, “Activate” etc. Recalling how many codes we need to write to implement a effective Activation function, then we could conceive the speed of using only one “Activate” instruction in TPU.
This paper said TPU is a type of Systolic Array. But what is Systolic Array? Here is the explain: A systolic array is a network of processors that rhythmically compute and pass data through the system.

There are lot of tables and diagrams which show the top-rate performance of TPU. Although the TPU is fast, it also depend on the computing-density of applications. The CNNs are most computing-dense NN, so it gains most speed(or TeraOps per second) from TPU:

In this paper, it didn’t explain why the GPU is slower than TPU in inference operation. The only sentence about this topic is in “8 Discussion”: “GPUs have traditionally been seen as high-throughput architectures that reply on high-bandwidth DRAM and thousands of threads to achieve their goals”. Actually, I think this is not a serious explain.
The interesting thing is, after Google publish this paper, the CEO of Nvidia – Jensen Huang, wrote a blog to gently appeal a fact: the state-of-the-art GPU (Tesla P40) can inference faster than TPU. The war between different giants of Deep learning is just beginning.

Using antlr3 to generate C++ code

Need to parse SQL query to C++ code in project, so I had to learn antlr these days.
Let’s write a small sample file “Calc.g” for antlr3:

Then add “antlr-3.5.2-complete.jar” (run “mvn package” on source code path of antlr3 will generate this jar) to CLASSPATH and run:

It will generate many code files: CalcLexer.[hpp/cpp], CalcParser.[hpp/cpp], Calc.tokens. Now we could compile all generated C++ codes:

(“ANTLR3_SRC_PATH” is where the antlr3 source code are
But this step lead compiler errors for g++:

The reason is ‘ID’ and ‘INT’ should be declare as “const CommonTokenType*”, rather than “CommonTokenType*”. The fix has already be commited to antlr3 and be contained in master branch of git tree. Therefore I checkout the master branch of antlr3 instead of “3.5.2” tag, re-package the jar of antlr3, re-generate the code for “Calc.g”, and the compiler errors disappeared.
The target executable file is “Test”, now we can use it to parse our “code”:

The result ’15’ is correct.

Read paper “A Column Store Engine for Real-Time Streaming Analytics” (MemSQL)

Paper reference: A Column Store Engine for Real-Time Streaming Analytics

According to the official website, MemSQL is “a high performance data warehouse designed for the cloud that delivers ultra fast insights of your live and historical data”. It use row-storage and lock-free engine for data in memory and column-storage for data in disk.
MemSQL could also store all data into disk. In its most durable state, MemSQL will not lose any transactions which have been acknowledged. And it implements “Read Commited” isolation level for transactions.
I heard about MemSQL in early 2012, but don’t know it has became a OLTP-and-OLAP system until several days ago.

What is the problem?
To fulfill OLAP jobs, MemSQL has to store data into disk with columnar-storage-format. But MemSQL still need to process OLTP requests such as random INSERT or UPDATE. Therefore it store data into a data structure named “Segment”. And by connecting different segments into a ordered segments list (which named “Sorted Runs”), MemSQL could balance the requirements between frequent INSERT/UPDATE operations and SCAN/GROUP BY operations.
For example:
1. Users INSERT three keys: 1, 92, 107. MemSQL will create a segment that contains the three keys:
        [1, 92, 107]
2. Users continue to INSERT two keys: 63, 84. The segments list are:
        [1, 92, 107]
        [63, 84]
3. After many INSERT operations, the segments become:
        [1, 92, 107]
        [2, 17, 42]
        [63, 84]
        [110, 118, 172]

Now,MemSQL makes these segments into “Sorted Runs”, which have a basic order for keys:

When the SELECT comes, MemSQL could find the row quickly by just looking up two ordered segment-lists. Uses could also SCAN two segment-lists effectively to do OLAP tasks, since all the data are stored in columnar-format.
What happen if users INSERT more rows? MemSQL will merge the old big Sorted-Runs and create new segment for freshly-insert-data, which could keep the number of Sorted-Runs acceptable.
In practice, MemSQL column store engine uses a constant of 8, so that the biggest sorted run has at least of all the segments, the second biggest sorted run has at least of the remaining segments, and so forth. This strategy seems just like LSM tree in LevelDB of Google. The difference between LevelDB and MemSQL is LevelDB store every key-value pair respectively but MemSQL store a batch of rows into one segment.

If INSERT operations come when MemSQL is merging, the small merging actions will be aborted and relaunch. For big merging actions, it will barely skip any missing or updated segments, for skipping some segments will not ruin the new merged Sorted-Runs.
As we can see, MemSQL endeavors to avoid locking for in-memory data operations, which makes its performance significantly superior.

There are also some practical considerations for MemSQL.

  1. If only one merger are working at big Sort-Runs, it will cost too much time and the small Sorted-Runs will become tremendous for intensive INSERT operations. So MemSQL launch two mergers: Fast Merger and Slow Merger.
  2. MemSQL could accumulates some rows before batching them into a segment, which will decrease the fragment of data.
  3. MemSQL also create special Commands for users to sort all the segments, or just decrease the level of Sorted-Runs.
  4. Columnar-Storage format in memory makes using SIMD instruments of CPU possible. This shed some light on me: may be one day we can run machine learning jobs on MemSQL directly 🙂

Read paper “iShuffle: Improving Hadoop Performance with Shuffle-on-Write”

Paper reference: iShuffle: Improving Hadoop Performance with Shuffle-on-Write

A job in Hadoop consists of three main stages: map, shuffle, reduce (Actually shuffle stage has been contained into reduce stage).

What is the problem?
Shuffle phase need to migrate large mount of data from nodes which running map job to those nodes which intend to run reduce job. This cause shuffle-latency which is usually significant. And the reason is:

  • Partitioning skew: Hadoop use hash algorithm to organize output data of map task, if too many keys have the same hash, it may cause unevenly size of partitions
  • Coupling of shuffle and reduce: data shuffling can’t be overlapped with map tasks

Solution: iShuffle

    • “Shuffler” collect intermediate data generated by every map task and predict size of respective partition
    • “Shuffler Manager” collect informations from “Shuffler” and decide the position of partitions

    • Shuffle-on-Write:While a map task writing a spill to local filesystem, it will (by modification of Hadoop code) also write spill to correspondent node where reduce task will be launched
    • Automated map out placement: iShuffle will decide the position for every partition by “map selectivity”, which is the ratio of map input size and map output size. After predicting the “map selectivity” and knowing the total input size of data, iShuffle could finally choose optimist node for every partition data

  • Flexible reduce scheduling: when a node request a reduce task, the Task Manager(after modification of Hadoop’s FIFO scheduler) will find the list of partitions reside on this node, and launch reduce task only for these partitions (Make sure reduce task will only read shuffled data from local filesystem which will reduce network-bandwidth at reduce stage)

In my opinion
Using prediction technology to proactively move map output to apt nodes, which avoid partition skew, is the most intelligent part of this paper. This tech could also be used to other intermediate data moving scenario like OLAP in Data-warehouse.
But, I also suspect that in real production, not too many organizations will use this “iShuffle” as they usually run multi-user applications in Hadoop system. When a lot of jobs running in one Hadoop cluster simultaneously, a low peak of CPU-usage made by long reduce latency of one job will be compensated by other computing-intensive jobs. Therefore to all users, none of hardware resources are wasted.

Performance comparison between CPU and GPU

To compare the performance of floating point arithmetic between Intel CPU and Nvidia GPU, I write some code to do the dot-product operation of two vectors with size of 2GB.
The code for CPU test is using AVX instrument:

and use

to compile it.
It cost 7.5 seconds to run this test program (LOOP is 10). But my colleague pointed out for me that this program is a “memory-intensive” program as it will sequentially access two 2GB vectors. The access of memory will cost CPU about 200~250 cycles but the _mm256_mul_ps() only cost 5~10 cycles, therefore the primary time has been waste on memory accessing. The effective way to test AVX instrument is using L1-cache of CPU artfully:

By chopping vectors into 4K “stride” and repeatedly run AVX instrument on one stride, we can use L1-cache of CPU more intensely. The result is prodigious: it cost only 0.78 seconds, almost ten times faster!

My colleague proceeded recommending me to use MKL (Intel’s Match Kernel Library) to test Xeon CPU because it was of many heavily optimizations for Intel-specific-hardware-architecture. In a word, it’s better to use library instead of raw code to evaluate performance of CPU and GPU. So finally, I decided to use mxnet to test performance with real data.


to build mxnet with cuDNN library (for GPU) and MKL(for CPU), I run my program for bird-classification. And the result shows: the performance of CPU and GPU is about 1 : 5, that GPU is much faster than total CPU-cores in a server.

Use mxnet to classify images of birds (third episode)

After using CNN in previous article, it still can’t recognize the correct name of birds if the little creature stand on the corner (instead of the center) of the whole picture. Then I started to think about the problem: how to let neural-network ignore the position of the bird in picture, but only focus on its exists? Eventually I recollected the “max pooling”:


By choose the max feature value from 2×2 pad, it will amplify the most important feature without affected by backgrounds. For example, if we split a picture into 2×2 chassis (4 plates) and the bird only stand in the first plate, the “max pooling” will choose only the first plate for next processing. Those trees, pools, leaves and other trivial issues in other three plates will be omitted.

Then I modify the structure of CNN again:

and using “0.3” for my learning rate, as “0.3” is better to against overfitting.

For one week (Chinese New Year Festival), I was studying “Neural Networks and Deep Learning”. This book is briefly awesome! A lot of doubts about Neural Networks for me have been explained and resolved. In third chapter, the author Michael Nielsen suggests a method, which really enlightened me, to defeat overfitting: artificially expanding training data. The example is rotating the MNIST handwritten digital picture by 15 degrees:

In my case, I decided to crop different parts of bird picture if the picture is a rectangle:

by using the python PIL (Picture Processing Library):

The effect of using “max pooling” and “expanding training data” is significant: