Problem about using slim.batch_norm() of Tensorflow

After using resnet_v2_50 in tensorflow/models, I found that the inference result is totally incorrect, though the training accuracy looks very well.
Firstly, I suspected the regularization of samples:

Indeed I had extended the image to a too big size. But after I changing padding size to ’10’, the inference accuracy was still incorrect.
Then I checked the code about importing data:

and changed my inference code as the data importing routines. But the problem still existed.

About one week past. Finally, I found this issue in Github. It explains all my questions: the cause is the slim.batch_norm(). After I adding these code to my program (learning from slim.create_train_op()):

The inference accuracy is — still low. Without other choice, I removed all slim.batch_norm() in resnet_v2.py, and at this time inference accuracy becomes the same with training accuracy.
Looks problem partly been solved, but I still need to find out why sli.batch_norm() doesn’t work well in inference …

Experiment for distributed Tensorflow

Here is my experimental code for distributed Tensorflow, which is learned from the example.

The important thing is that we need to use tf.assign() to push Variable back to Parameter Server. The operation ‘tf.add’ was about to run on the task0 of worker in this example. But if we deploy more complicated application by many tasks, things became weird: a pipeline operation sometimes even runs on ‘ps’ role! The official solution to this problem is using ‘tf.train.replica_device_setter()’, which will automatically deploy Variables to parameter servers and Operations (many replicas) to many workers. What did ‘tf.train.replica_device_setter()’ do? Let’s see the backbone code of its implementation:

All the Variables will be counted as ‘ps_ops’, and the deploy strategy for Operations will be replication, for it’s called ‘_ReplicaDeviceChooser’.

All the ‘op’ in ‘self._ps_ops’ will be put into ‘ps_device’.

Performance problem for training images on MXNet

After running my MXNet application like this snippet:

I found out that the training speed is only 300 samples per second, and the usage of GPU looks very strange:

About two days later, I just noticed that there are some messages reported by MXNet:

After changing my command to:

the speed of training has changed to 690 samples per second, and the usage of GPU became much smoothly since it could use more CPUs to decode image now:

The problem of ‘bool’ type in argparse of Python 2.7

To learn the example of distributed Tensorflow, I wrote this snippet:

The “parser.register()” is the Tensorflow way of register ‘bool’ type for parser. But it can’t work! In my shell, I run

They all print out “Namespace(training=True)”, which means the code above can’t change value of argument ‘training’ (My Python’s version is 2.7.5).

The correct codes should be: