Experiment for distributed Tensorflow

Here is my experimental code for distributed Tensorflow, which is learned from the example.

The important thing is that we need to use tf.assign() to push Variable back to Parameter Server. The operation ‘tf.add’ was about to run on the task0 of worker in this example. But if we deploy more complicated application by many tasks, things became weird: a pipeline operation sometimes even runs on ‘ps’ role! The official solution to this problem is using ‘tf.train.replica_device_setter()’, which will automatically deploy Variables to parameter servers and Operations (many replicas) to many workers. What did ‘tf.train.replica_device_setter()’ do? Let’s see the backbone code of its implementation:

All the Variables will be counted as ‘ps_ops’, and the deploy strategy for Operations will be replication, for it’s called ‘_ReplicaDeviceChooser’.

All the ‘op’ in ‘self._ps_ops’ will be put into ‘ps_device’.

Performance problem for training images on MXNet

After running my MXNet application like this snippet:

I found out that the training speed is only 300 samples per second, and the usage of GPU looks very strange:

About two days later, I just noticed that there are some messages reported by MXNet:

After changing my command to:

the speed of training has changed to 690 samples per second, and the usage of GPU became much smoothly since it could use more CPUs to decode image now:

The CSE (Common Subexpression Elimination) problem about running custom operation in Tensorflow

Recently, we create a new custom operation in Tensorflow:

It’s as simple as the example in Tensorflow’s document. But when we run this Op in session:

It only get image_ids from network once, and then use the result of first ‘run’ forever, without even call ‘Compute()’ function in cpp code again!

Seems Tensorflow optimized the new Op and never run it twice. My colleague give a suggestion to solve this problem by using tf.placeholder:

Looks a little tricky. The final solution is add flag in cpp code to let new Op to avoid CSE (Common Subexpression Elimination):

Attachment of the ‘CMakeLists.txt’:

Compute gradients of different part of model in Tensorflow

In Tensorflow, we could use Optimizer to train model:

But sometimes, model need to be split to two parts and trained separately, so we need to compute gradients and apply them by two steps:

Then how could we delivery gradients from first part to second part? Here is the equation to answer:

\frac{\partial Loss} {\partial W_{second-part}} = \frac{\partial Loss} {\partial IV} \cdot \frac{\partial IV} {\partial W_{second-part}}

The IV means ‘intermediate vector’, which is the interface vector between first-part and second-part and it is belong to both first-part and second-part. The W_{second-part} is the weights of second part of model. Therefore we could use tf.gradients() to connect gradients of two parts:

Sharing Variables in Tensorflow

This article shows how to use sharing variables in Tensroflow. But I still have a question: dose sharing variables have the same value? To answer this question, I write these code below:

Therefore, the “sharing variables” mechanism is made only for convenience of writing short code to create multi-models. For sharing same value for different variables, we still need ‘assign’ operation.

Some tips about Tensorflow

Q: How to fix error report like

A: We can’t feed a value into a variable and optimize it in the same time (So the problem only occurs when using Optimizers). Should using ‘tf.assign()’ in graph to give value to tf.Variable

Q: How to get a tensor by name?

A: like this:

Q: How to get variable by name?


How to average gradients in Tensorflow

Sometimes, we need to average an array of gradients in deep learning model. Fortunately, Tensorflow divided models into fine-grained tensors and operations, therefore it’s not difficult to implement gradients average by using it.

Let’s see the code from github

We should keep in mind that these codes will only build a static graph (the ‘grads; are references rather than values).

First, we need to expand dimensions of tensor(gradient) and concatenate them. Then use reduce_mean() to do actually average operation (seems not intuitive).

A basic example of using Tensorflow to regress

In theory of Deep Learning, even a network with single hidden layer could represent any function of mathematics. To verify it, I write a Tensorflow example as below:

In this code, it was trying to regress to a number from its own sine-value and cosine-value.
At first running, the loss didn’t change at all. After I changed learning rate from 1e-3 to 1e-5, the loss slowly went down as normal. I think this is why someone call Deep Learning a “Black Magic” in Machine Learning area.

Fix Resnet-101 model in example of MXNET

SSD(Single Shot MultiBox Detector) is the fastest method in object-detection task (Another detector YOLO, is a little bit slower than SSD). In the source code of MXNET,there is an example for SSD implementation. I test it by using different models: inceptionv3, resnet-50, resnet-101 etc. and find a weird phenomenon: the size .params file generated by resnet-101 is smaller than resnet-50.

Model Size of .params file
resnet-50 119MB
resnet-101 69MB

Since deeper network have larger number of parameters, resnet-101 has smaller file size for parameters seems suspicious.

Reviewing the code of example/ssd/symbol/symbol_factory.py:

Why resnet-50 and resnet-101 has the same ‘from_layers’ ? Let’s check these two models:

In resnet-50, the SSD use two layers (as show in red line) to extract features. One from output of stage-3, another from output of stage-4. In resnet-101, it should be the same (as show in blue line), but it incorrectly copy the config code of resnet-50. The correct ‘from_layers’ for resnet-100 is:

This seems like a bug, so I create a pull request to try fixing it.

“Eager Mode” in Tensorflow

Although Tensorflow is the most popular Deep Learning Framework in 2016, Pytorch, a smaller new framework developed by FAIR(Facebook AI Research), become a dark horse this year. Pytorch supports Dynamic Graph Computing, which means you can freely add or remove layers in your model at runtime. It makes developer or scientist build new models more rapidly.
To fight back Pytorch, Tensorflow team add a new mechanism named “Eager Mode”, in which we could also use Dynamic Graph Computing. The example of “Eager Mode” looks like:

As above, unlike traditional Tensorflow application that use “Session.run()” to execute whole graph, developers could see values and gradients of variables in any layer at any step.

How did Tensorflow do it? Actually, the tricks behind the API is not difficult. Take the most common Operation ‘matmul’ as example:

Le’t look into “gen_math_ops._mat_mul()”:

As we can see, in Graph Mode, it will go to “_apply_op_helper()” to build graph (but not running it). In Eager Mode, it will execute the Operation directly.