Small tips about containers in Intel Threading Building Blocks and C++11

Changing values in container

The code above will not change any value in container ‘table’. ‘auto’ will become std::pair and ‘item’ will be a copy of real item in ‘table’, so modify ‘item’ will not change the actual value in container.
The correct way is:


Do traversal and modification concurrently in a container
Using concurrent_hash_map like this:

will cause the program to coredump.
The reason is that concurrent_hash_map can’t be modified and traversed concurrently.

Actually, Intel figure out another solution to concurrently-traverse-and-insert: concurrent_unordered_map.
But still be careful, concurrent_unordered_map support simultaneous traversal and insertion, but not simultaneous traversal and erasure.

The CSE (Common Subexpression Elimination) problem about running custom operation in Tensorflow

Recently, we create a new custom operation in Tensorflow:

It’s as simple as the example in Tensorflow’s document. But when we run this Op in session:

It only get image_ids from network once, and then use the result of first ‘run’ forever, without even call ‘Compute()’ function in cpp code again!

Seems Tensorflow optimized the new Op and never run it twice. My colleague give a suggestion to solve this problem by using tf.placeholder:

Looks a little tricky. The final solution is add flag in cpp code to let new Op to avoid CSE (Common Subexpression Elimination):

Attachment of the ‘CMakeLists.txt’:

Some details about Arduino Uno

In previous article, I reviewed some open source hardwares for children programming, and considered the Micro:bit is the best choice. But recently, after searching many different types of micro-controllers, I noticed an advanced version of Arduino Uno R3 (less than $3) is desperately cheaper than Micro:bit (about $16) in (a famous e-marketing website in China). Despite the low price, Arduino also have a simple and intuitive IDE (Integrated Development Environment):

The programming language for Arduino is Processing, which looks just like C-language. A python developer could also learn Processing very quickly, for it is also a typeless language.
I also find an example code of blinking a LED bubble by Arduino:

Easy enough, right? But it still have more convenient graphic programming tools, such as ArduBlock and Mixly

The demo of ‘ArduBlock’

The demo of ‘Mixly’

With Easy-learning language, and graphic programming interface, I think Arduino is also a good choice for children programming.

Compute gradients of different part of model in Tensorflow

In Tensorflow, we could use Optimizer to train model:

But sometimes, model need to be split to two parts and trained separately, so we need to compute gradients and apply them by two steps:

Then how could we delivery gradients from first part to second part? Here is the equation to answer:

\frac{\partial Loss} {\partial W_{second-part}} = \frac{\partial Loss} {\partial IV} \cdot \frac{\partial IV} {\partial W_{second-part}}

The IV means ‘intermediate vector’, which is the interface vector between first-part and second-part and it is belong to both first-part and second-part. The W_{second-part} is the weights of second part of model. Therefore we could use tf.gradients() to connect gradients of two parts:

My choice between Raspberry PI, Arduino, Pyboard and Micro:bit

I want to teach my child about programming. But Teaching child to sit steadily and keep watching computer screen is not easy, I think, for children usually can’t focus on the boring developing IDE for more than ten minutes. Therefore I try to find some micro-controller which could be used to do some interesting works for kids, such as getting temperature/humidity from environment, or control micro motors on toy car.

There are many micro-controller or micro-computer chips on market. Then I have to compare them and finally choose the most suitable one.

Raspberry PI

Raspberry PI is very powerful. It could do almost anything that a personal-computer or laptop can do. But the problem about Raspberry PI is it is too difficult to learn for a child. And another reason I give up it is the price: $35 for only the bare chip without any peripherals.


Arduino is cheap enough. But you could only use C-language to program it. Using C-language need strong knowledges about computer science, such as memory models and data structures. Imaging use C-language to implement a working-well dictionary, it looks like building a space ship in backyard for a pupil.

Until now, I could narrow my choices to chips that could support python, or micro-python. Because Python is easy to understand, looks much straightforward, and also could be used in imperative mode. In one word, it’s much easier to learn than C.

So let’s take a look at chips which support micro-python.


Pyboard is simple and cheap enough, and also supports micro-python. The only imperfection of it is that its hardware interface is hard to use for someone who doesn’t know hardware very well.


Launched in 2015 by BBC, Micro:bit is the most suitable chip for a child to learning program and even IOT(Internet Of Things), in my opinion. It is cheap: only $18. It supports programming by micro-python and Javascript (Graphic IDE), so child could using a few lines of Python code to control it. It also support a bunch of peripherals such as micro motor, thermometer, hygrometer, bluetooth and WIFI. Children could use it as core-controller to run a intelligent toy car. The Micro:bit even have Android/IOS app for operation, which is perfect for little child under 7 years.

So this is my choice: Micro:bit.

Technical Meeting with Nvidia Corporation

Last week I went to Nvidia Corporation of Santa Clara (California) with my colleagues to join a technical meeting about cutting-edge hardware and software of Deep Learning.

The new office building of NVIDIA

At first day, team leaders from Nvidia introduced their developing plan of new hardware and software. The new hardwares are about Tesla V100, NVLink, and HGX (next generation of DGX). And the softwares are about CUDA-9.2 NCCL-2.0, and TensorRT-3.0

Here are some notes about their introducing:

  • The next generation of Tesla P4 GPU will have tensor-core, 16GB memory, and H264 decoder (performance as Tesla P100) for better inference performance, especially for image/video processing.
  • The software support of tensor-core (mainly in Tesla V100 GPU) has been integrated in Tensorflow-1.5 version.
  • The TensorRT could turn three layers of Deep Learning (Conv layer, Bias layer, Relu layer) to one CBR layer, eliminate concatenation layers, to accelerate inference computing.
  • The tool ‘nvidia-smi’ could show ‘util’ of GPU. But ‘80%’ util only means this GPU run task (no matter how many CUDA-cores has been used) for 0.8 second in one second period. Therefore it’s not a accurate metrics for real GPU load. NVPROF is the much powerful and accurate tool for profiling of GPU


At second day, many teams from Alibaba (my company) ask Nvidia different questions. Here are some questions and answers:

Q: Some Deep Learning Compilers such as XLA (Google) and TVM(from AWS) could compile python code to GPU intermediate representation directly. How will Nvidia work with these application-oriented compiler?

A: The google XLA team will be shut off and move to optimize TPU performance only. Nvidia will still focus on library such as CUDA/cuDNN/TensorRT and will not build frameworks like Tensorflow or Mxnet.

Q: There are many new hardwares launched for Deep Learning: Google’s TPU, some ASICs developed by other companies. How will Nvidia keep cost performance over these new competitors?

A: ASICs are not programmable. If models of Deep Learning changes, the ASIC will be in trash. For example, TPU has Relu/Conv instructions, but if it comes new type of activation function, it will not work anymore. Furthermore, customers can only run TPU on Google’s cloud, which means they have to put their data on cloud, without other choices.

The DGX server

We also visited the Demo Room of Nvidia’s state-of-art hardware for auto-driving and deep learning. It was an effective meeting, and we learn a lot.

The car of auto-driving testing platform

I am standing before the NVIDIA logo

Books I read in year 2017

2017 is not a easy year for me, therefore I read so many books to comfort myself. The books show above are just the top-10 books I rated.

I can’t remember how many times I have read “The old man and the sea”. This time, I read it for my daughter because she want to hear some “fantastic and powerful” stories. After I told her the whole story (of course, I changed some parts of the story for child), she is a little puzzle. Hmm, it’s hard to understand for a girl, even the story is truly “powerful” 🙂

At the beginning of 2017 (about February), my old colleague Jian Mei asked my help for building an Bird-Classification-Application. I accept his requirement and start to learn knowledge of Deep Learning from beginning. Actually, this task makes me “alive” again: I begin to learn new frameworks (MXNET and Tensorflow), read papers (I haven’t read paper for many years) and thick books (such as “Deep Learning“). Finally, we have completed the application for classifying Chinese birds, and I begin my new career on Deep Learning area. Thanks to my old colleague again.

When I was a child, I read a small comic book about a family living in a isolated island. At January, I found a English book in bookshop, which named “The Swiss Family Robinson”. Suddenly I realized this must be the original version of that old comic book. So I bought it. It only cost me 20 RMB (about 3 dollars, book is desperately cheap in China). In the following weeks, I read this book chapter by chapter and tell them to my daughter. This time, she can understand, and became interesting about this old story.

In 2015, I traveled to Boston to give my presentation on Linux Vault Conference with my colleague Coly Li. In MIT book shop, I bought many books, include a history book named “Ancient Rome” (I have read a lot of books about ancient Rome, but none is English version). But year 2015 and 2016 are very busy, and I only have enough time to read the book over in 2017. This book contains many pictures, which is good for children too. Maybe in future, I could read it for my children.

Sharing Variables in Tensorflow

This article shows how to use sharing variables in Tensroflow. But I still have a question: dose sharing variables have the same value? To answer this question, I write these code below:

Therefore, the “sharing variables” mechanism is made only for convenience of writing short code to create multi-models. For sharing same value for different variables, we still need ‘assign’ operation.

Some tips about Tensorflow

Q: How to fix error report like

A: We can’t feed a value into a variable and optimize it in the same time (So the problem only occurs when using Optimizers). Should using ‘tf.assign()’ in graph to give value to tf.Variable

Q: How to get a tensor by name?

A: like this:

Q: How to get variable by name?


How to average gradients in Tensorflow

Sometimes, we need to average an array of gradients in deep learning model. Fortunately, Tensorflow divided models into fine-grained tensors and operations, therefore it’s not difficult to implement gradients average by using it.

Let’s see the code from github

We should keep in mind that these codes will only build a static graph (the ‘grads; are references rather than values).

First, we need to expand dimensions of tensor(gradient) and concatenate them. Then use reduce_mean() to do actually average operation (seems not intuitive).