Monthly Archives: April 2018

Problem about using slim.batch_norm() of Tensorflow

After using resnet_v2_50 in tensorflow/models, I found that the inference result is totally incorrect, though the training accuracy looks very well.
Firstly, I suspected the regularization of samples:

  image = tf.image.resize_image_with_crop_or_pad(image, IMAGE_HEIGHT + 66, IMAGE_WIDTH + 66)
  image = tf.random_crop(image, [IMAGE_HEIGHT, IMAGE_WIDTH, IMAGE_CHANNELS])
  image = tf.image.random_flip_left_right(image)

Indeed I had extended the image to a too big size. But after I changing padding size to ’10’, the inference accuracy was still incorrect.
Then I checked the code about importing data:

# To avoid various formats of picture, I encode all image to 'jpeg' and write them as TFRecord
img = cv2.imread(file_name)
raw_image = cv2.imencode('.jpeg', img)[1].tostring()
....
# When importing data from TFRecord
image = tf.image.decode_image(image)

and changed my inference code as the data importing routines. But the problem still existed.
About one week past. Finally, I found this issue in Github. It explains all my questions: the cause is the slim.batch_norm(). After I adding these code to my program (learning from slim.create_train_op()):

update_ops = set(ops.get_collection(ops.GraphKeys.UPDATE_OPS))
with ops.control_dependencies(update_ops):
  barrier = control_flow_ops.no_op(name='update_barrier')
total_loss = control_flow_ops.with_dependencies([barrier], total_loss)
grads = optimizer.compute_gradients(total_loss)
...

The inference accuracy is — still low. Without other choice, I removed all slim.batch_norm() in resnet_v2.py, and at this time inference accuracy becomes the same with training accuracy.
Looks problem partly been solved, but I still need to find out why sli.batch_norm() doesn’t work well in inference …

Experiment for distributed Tensorflow

Here is my experimental code for distributed Tensorflow, which is learned from the example.

import tensorflow as tf
import argparse
import time
FLAGS = None
def main():
    print(tf.__version__)
    cluster_spec = tf.train.ClusterSpec({
        'worker': ['localhost:1829'],
        'ps': ['localhost:1057'],
        })
    if FLAGS.ps:
        server = tf.train.Server(cluster_spec, job_name = 'ps', task_index = 0)
        server.join()
    else:
        server = tf.train.Server(cluster_spec, job_name = 'worker', task_index = FLAGS.worker)
        print(server.target)
        with tf.device('/job:ps/task:0'):
            init = tf.constant_initializer([0])
            c = tf.get_variable('myc', shape = [], initializer = init)
        res = tf.add(c, 1)
        train_op = tf.assign(c, res)
        with tf.Session(target = server.target) as sess:
            c.initializer.run()
            while True:
                res = sess.run(train_op)
                print(res)
                time.sleep(1)
...

The important thing is that we need to use tf.assign() to push Variable back to Parameter Server. The operation ‘tf.add’ was about to run on the task0 of worker in this example. But if we deploy more complicated application by many tasks, things became weird: a pipeline operation sometimes even runs on ‘ps’ role! The official solution to this problem is using ‘tf.train.replica_device_setter()’， which will automatically deploy Variables to parameter servers and Operations (many replicas) to many workers. What did ‘tf.train.replica_device_setter()’ do? Let’s see the backbone code of its implementation:

def replica_device_setter(ps_tasks=0, ps_device="/job:ps",
                          worker_device="/job:worker", merge_devices=True,
                          cluster=None, ps_ops=None, ps_strategy=None):
...
  if ps_ops is None:
    # TODO(sherrym): Variables in the LOCAL_VARIABLES collection should not be
    # placed in the parameter server.
    ps_ops = ["Variable", "VariableV2", "VarHandleOp"]
  if not merge_devices:
    logging.warning(
        "DEPRECATION: It is recommended to set merge_devices=true in "
        "replica_device_setter")
  if ps_strategy is None:
    ps_strategy = _RoundRobinStrategy(ps_tasks)
  if not six.callable(ps_strategy):
    raise TypeError("ps_strategy must be callable")
  chooser = _ReplicaDeviceChooser(
      ps_tasks, ps_device, worker_device, merge_devices, ps_ops, ps_strategy)
  return chooser.device_function

All the Variables will be counted as ‘ps_ops’, and the deploy strategy for Operations will be replication, for it’s called ‘_ReplicaDeviceChooser’.

def device_function(self, op):
...
    node_def = op if isinstance(op, node_def_pb2.NodeDef) else op.node_def
    if self._ps_tasks and self._ps_device and node_def.op in self._ps_ops:
      ps_device = pydev.DeviceSpec.from_string(self._ps_device)
      current_job, ps_job = current_device.job, ps_device.job
      if ps_job and (not current_job or current_job == ps_job):
        ps_device.task = self._ps_strategy(op)
      ps_device.merge_from(current_device)
      return ps_device.to_string()
    worker_device = pydev.DeviceSpec.from_string(self._worker_device or "")
    worker_device.merge_from(current_device)
    return worker_device.to_string()

All the ‘op’ in ‘self._ps_ops’ will be put into ‘ps_device’.

Performance problem for training images on MXNet

After running my MXNet application like this snippet:

MXNET_CUDNN_AUTOTUNE_DEFAULT=0 python bird.py train 0.8 &> log

I found out that the training speed is only 300 samples per second, and the usage of GPU looks very strange:

# nvidia-smi -l |grep Default
| N/A   48C    P0   184W / 250W |   4971MiB / 16276MiB |     98%      Default |
| N/A   44C    P0   145W / 250W |   4971MiB / 16276MiB |     98%      Default |
| N/A   47C    P0    44W / 250W |   4971MiB / 16276MiB |     98%      Default |
| N/A   43C    P0    40W / 250W |   4971MiB / 16276MiB |     98%      Default |
| N/A   45C    P0   182W / 250W |   4971MiB / 16276MiB |      0%      Default |
| N/A   42C    P0   182W / 250W |   4971MiB / 16276MiB |      0%      Default |
| N/A   44C    P0    42W / 250W |   4971MiB / 16276MiB |      0%      Default |
| N/A   42C    P0    41W / 250W |   4971MiB / 16276MiB |      0%      Default |
| N/A   45C    P0    42W / 250W |   4971MiB / 16276MiB |      0%      Default |
| N/A   42C    P0    38W / 250W |   4971MiB / 16276MiB |      0%      Default |
| N/A   45C    P0    42W / 250W |   4971MiB / 16276MiB |     21%      Default |
| N/A   42C    P0    38W / 250W |   4971MiB / 16276MiB |     17%      Default |
| N/A   45C    P0    42W / 250W |   4971MiB / 16276MiB |     48%      Default |
| N/A   42C    P0    38W / 250W |   4971MiB / 16276MiB |     44%      Default |
| N/A   45C    P0    42W / 250W |   4971MiB / 16276MiB |      0%      Default |
| N/A   42C    P0    38W / 250W |   4971MiB / 16276MiB |      0%      Default |
| N/A   45C    P0    42W / 250W |   4971MiB / 16276MiB |     41%      Default |
| N/A   42C    P0    38W / 250W |   4971MiB / 16276MiB |     36%      Default |
| N/A   45C    P0    43W / 250W |   4971MiB / 16276MiB |     98%      Default |
| N/A   42C    P0    38W / 250W |   4971MiB / 16276MiB |     98%      Default |

About two days later, I just noticed that there are some messages reported by MXNet:

INFO:root:Using 1 threads for decoding...
INFO:root:Set enviroment variable MXNET_CPU_WORKER_NTHREADS to a larger number to use more threads.

After changing my command to:

MXNET_CPU_WORKER_NTHREADS=16 MXNET_CUDNN_AUTOTUNE_DEFAULT=0 python bird.py train 0.8 &> log

the speed of training has changed to 690 samples per second, and the usage of GPU became much smoothly since it could use more CPUs to decode image now:

# nvidia-smi -l |grep Default
| N/A   40C    P0   173W / 250W |   4971MiB / 16276MiB |     96%      Default |
| N/A   45C    P0   182W / 250W |   4971MiB / 16276MiB |     98%      Default |
| N/A   42C    P0   183W / 250W |   4971MiB / 16276MiB |     96%      Default |
| N/A   46C    P0   163W / 250W |   4971MiB / 16276MiB |     98%      Default |
| N/A   43C    P0   153W / 250W |   4971MiB / 16276MiB |     98%      Default |
| N/A   48C    P0   181W / 250W |   4971MiB / 16276MiB |     96%      Default |
| N/A   44C    P0   168W / 250W |   4971MiB / 16276MiB |     96%      Default |
| N/A   49C    P0   190W / 250W |   4971MiB / 16276MiB |     98%      Default |
| N/A   45C    P0   181W / 250W |   4971MiB / 16276MiB |     96%      Default |
| N/A   50C    P0   136W / 250W |   4971MiB / 16276MiB |     98%      Default |
| N/A   46C    P0   138W / 250W |   4971MiB / 16276MiB |     98%      Default |
| N/A   51C    P0   186W / 250W |   4971MiB / 16276MiB |     96%      Default |
| N/A   47C    P0   161W / 250W |   4971MiB / 16276MiB |     96%      Default |
| N/A   52C    P0   212W / 250W |   4971MiB / 16276MiB |     98%      Default |
| N/A   48C    P0   192W / 250W |   4971MiB / 16276MiB |     96%      Default |
| N/A   52C    P0   155W / 250W |   4971MiB / 16276MiB |     98%      Default |
| N/A   48C    P0   152W / 250W |   4971MiB / 16276MiB |     98%      Default |
| N/A   54C    P0   180W / 250W |   4971MiB / 16276MiB |     96%      Default |
| N/A   49C    P0   166W / 250W |   4971MiB / 16276MiB |     96%      Default |
| N/A   54C    P0   194W / 250W |   4971MiB / 16276MiB |     98%      Default |

The problem of ‘bool’ type in argparse of Python 2.7

To learn the example of distributed Tensorflow, I wrote this snippet:

import argparse
FLAGS = None
if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.register("type", "bool", lambda v: v.lower() == "true")
    parser.add_argument(
        "--training",
        type = bool,
        default = True,
    )
    FLAGS, unparsed = parser.parse_known_args()
    print(FLAGS)

The “parser.register()” is the Tensorflow way of register ‘bool’ type for parser. But it can’t work! In my shell, I run

python test.py --training false
python test.py --training False

They all print out “Namespace(training=True)”, which means the code above can’t change value of argument ‘training’ (My Python’s version is 2.7.5).
The correct codes should be:

def str2bool(value):
    return value.lower() == 'true'
if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--training",
        type = str2bool,
        default = True,
    )

Robin on Linux

Monthly Archives: April 2018