Experiment for distributed Tensorflow

Here is my experimental code for distributed Tensorflow, which is learned from the example.

The important thing is that we need to use tf.assign() to push Variable back to Parameter Server. The operation ‘tf.add’ was about to run on the task0 of worker in this example. But if we deploy more complicated application by many tasks, things became weird: a pipeline operation sometimes even runs on ‘ps’ role! The official solution to this problem is using ‘tf.train.replica_device_setter()’, which will automatically deploy Variables to parameter servers and Operations (many replicas) to many workers. What did ‘tf.train.replica_device_setter()’ do? Let’s see the backbone code of its implementation:

All the Variables will be counted as ‘ps_ops’, and the deploy strategy for Operations will be replication, for it’s called ‘_ReplicaDeviceChooser’.

All the ‘op’ in ‘self._ps_ops’ will be put into ‘ps_device’.

Performance problem for training images on MXNet

After running my MXNet application like this snippet:

I found out that the training speed is only 300 samples per second, and the usage of GPU looks very strange:

About two days later, I just noticed that there are some messages reported by MXNet:

After changing my command to:

the speed of training has changed to 690 samples per second, and the usage of GPU became much smoothly since it could use more CPUs to decode image now:

The problem of ‘bool’ type in argparse of Python 2.7

To learn the example of distributed Tensorflow, I wrote this snippet:

The “parser.register()” is the Tensorflow way of register ‘bool’ type for parser. But it can’t work! In my shell, I run

They all print out “Namespace(training=True)”, which means the code above can’t change value of argument ‘training’ (My Python’s version is 2.7.5).

The correct codes should be: