Use pandas and matplotlib to draw line chart

I have two CSV files. Their content looks like:

The simplest way to load and draw them is by using pandas and matplotlib.

The figure draw out by this snippet is shown below:


matplotlib

The problem of ‘bool’ type in argparse of Python 2.7

To learn the example of distributed Tensorflow, I wrote this snippet:

The “parser.register()” is the Tensorflow way of register ‘bool’ type for parser. But it can’t work! In my shell, I run

They all print out “Namespace(training=True)”, which means the code above can’t change value of argument ‘training’ (My Python’s version is 2.7.5).

The correct codes should be:

Using Python to access HBase through JPype

First, we need to write a Java function to get data from HBase:

Then use maven to build it to one jar file with all dependent libraries:

Now, we could use python to call this Class from java by using JPype:

This python example could run correctly. But if we use it in tf.py_func(), it will core dump in libjvm.so, which is difficult to debug. So at last we choose to write operation by c++ to access HBase through Thrift server, which is better for stability and grace of architecture.

Small tips about containers in Intel Threading Building Blocks and C++11

Changing values in container

The code above will not change any value in container ‘table’. ‘auto’ will become std::pair and ‘item’ will be a copy of real item in ‘table’, so modify ‘item’ will not change the actual value in container.
The correct way is:

or:

Do traversal and modification concurrently in a container
Using concurrent_hash_map like this:

will cause the program to coredump.
The reason is that concurrent_hash_map can’t be modified and traversed concurrently.

Actually, Intel figure out another solution to concurrently-traverse-and-insert: concurrent_unordered_map.
But still be careful, concurrent_unordered_map support simultaneous traversal and insertion, but not simultaneous traversal and erasure.

Importance of function’s return Type in Scala

The Scala code is below:

But the output of this section of code is:

Nothing, just a pair of parenthesis. What happen? Why doesn’t Scala use the ‘call’ function in subclass ‘Student’?
The answer is: we forget to define the return Type of function ‘call’, so its default return Type is ‘Unit’. The value of ‘Unit’ is only ‘()’. Hence no matter what value we give to ‘call’, it only return ‘()’.
We just to add return Type:

It print “robin” now.

Using antlr3 to generate C++ code

Need to parse SQL query to C++ code in project, so I had to learn antlr these days.
Let’s write a small sample file “Calc.g” for antlr3:

Then add “antlr-3.5.2-complete.jar” (run “mvn package” on source code path of antlr3 will generate this jar) to CLASSPATH and run:

It will generate many code files: CalcLexer.[hpp/cpp], CalcParser.[hpp/cpp], Calc.tokens. Now we could compile all generated C++ codes:

(“ANTLR3_SRC_PATH” is where the antlr3 source code are
But this step lead compiler errors for g++:

The reason is ‘ID’ and ‘INT’ should be declare as “const CommonTokenType*”, rather than “CommonTokenType*”. The fix has already be commited to antlr3 and be contained in master branch of git tree. Therefore I checkout the master branch of antlr3 instead of “3.5.2” tag, re-package the jar of antlr3, re-generate the code for “Calc.g”, and the compiler errors disappeared.
The target executable file is “Test”, now we can use it to parse our “code”:

The result ’15’ is correct.

Performance comparison between CPU and GPU

To compare the performance of floating point arithmetic between Intel CPU and Nvidia GPU, I write some code to do the dot-product operation of two vectors with size of 2GB.
The code for CPU test is using AVX instrument:

and use

to compile it.
It cost 7.5 seconds to run this test program (LOOP is 10). But my colleague pointed out for me that this program is a “memory-intensive” program as it will sequentially access two 2GB vectors. The access of memory will cost CPU about 200~250 cycles but the _mm256_mul_ps() only cost 5~10 cycles, therefore the primary time has been waste on memory accessing. The effective way to test AVX instrument is using L1-cache of CPU artfully:

By chopping vectors into 4K “stride” and repeatedly run AVX instrument on one stride, we can use L1-cache of CPU more intensely. The result is prodigious: it cost only 0.78 seconds, almost ten times faster!

My colleague proceeded recommending me to use MKL (Intel’s Match Kernel Library) to test Xeon CPU because it was of many heavily optimizations for Intel-specific-hardware-architecture. In a word, it’s better to use library instead of raw code to evaluate performance of CPU and GPU. So finally, I decided to use mxnet to test performance with real data.

Using

to build mxnet with cuDNN library (for GPU) and MKL(for CPU), I run my program for bird-classification. And the result shows: the performance of CPU and GPU is about 1 : 5, that GPU is much faster than total CPU-cores in a server.

A CUDA program to test performance of GPU

For testing performance of our Nvidia GPU, I have to write my first CUDA program to mutiply two Vectors with each size of 2GB:

Luckily, it works 🙂
The cudaMemcpy() cost about 1 second, but the multiplication of two Vectors cost only 80 micro seconds (even with 10 LOOP as default). Therefore I reckon GPU is perfect for training of Machine Learning, but not promising for predicting when Model has been built.

Note: Use cudaMalloc()/cudaMemcpy() instead of malloc()/memcpy() in Standard C Library, or else the program will not run VecMul<<<>>>

The type of variables in Python

Haven’t written python code for more than one year, I met this simple problem:

Even the code have print out the value of “a” and “b” as 2 and 1, the condition check “if a >= b:” is false!

Spending more than 10 minutes, I eventually get the reason: the type of “a” is “int” but “b” is “string” (and the interpreter of Python will not report any warning about this “inconsistency”). I should have been taking enough care of the type of these variables.
Seems “print” can’t reveal adequate details of a variable, therefore it is highly suggested we using “pprint” instead of “print”.

The result will be

Using “sysbench” to test memory performance

Sysbench is a powerful testing tool for CPU / Memory / Mysql etc. Three years ago, I used to test performance of MYSQL by using it.
Yesterday, I used Sysbench to test memory bandwidth of my server.
By using command:

It reported the memory bandwidth could reach 8.4GB/s, which did make sense for me.
But after decrease the block size (Change 1M to 1K):

The memory bandwidth reported by Sysbench became only 2GB/s

This regression of memory performance really confuse me. Maybe the memory of modern machines has some kind of “Max limited frequency” so we can’t access memory with too high frequency?
After checked the code of Sysbench, I found out its logic about memory test is just like this program (I wrote it myself):

But this test program cost only 14 seconds (Sysbench cost 49 seconds). To find out the root cause, we need to use a more powerful tool — perf:

They have totally different CPU cache-misses. The root cause is because Sysbench use a complicate framework to support different test targets (Mysql/Memory …), which need to pass a structure named “request” and many other arguments in and out of execution_request() function many times in one request (accessing 1K memory, in our scenario), this overload becomes big when block size is too small.

The conclusion is: don’t use Sysbench to test memory performance by using too small block size, better bigger than 1MB.

Ref: by Coly Li ‘s teaching, memory do have “top limit access frequency” (link). Take DDR4-1866 for example: it’s data rate is 1866MT/s (MT = Mega Transfer) and every transfer takes 8 bytes, so we can access memory more than 1 billion times per second, theoretically.