Monthly Archives: December 2017

Some tips about Tensorflow

Q: How to fix error report like

tensorflow.python.framework.errors_impl.InvalidArgumentError: Input 0 of node
Adam/update_embeddings/AssignSub was passed float from _recv_embeddings_0:0
incompatible with expected float_ref

A: We can’t feed a value into a variable and optimize it in the same time (So the problem only occurs when using Optimizers). Should using ‘tf.assign()’ in graph to give value to tf.Variable
Q: How to get a tensor by name?
A: like this:

tensor = tf.get_default_graph().get_tensor_by_name("example:0")

Q: How to get variable by name?
A:

my_var = [v for v in tf.global_variables() if v.name == "myvar_23:0"][0]

How to average gradients in Tensorflow

Sometimes, we need to average an array of gradients in deep learning model. Fortunately, Tensorflow divided models into fine-grained tensors and operations, therefore it’s not difficult to implement gradients average by using it.
Let’s see the code from github：

with tf.variable_scope(tf.get_variable_scope()):
      for i in xrange(FLAGS.num_gpus):
        with tf.device('/gpu:%d' % i):
          with tf.name_scope('%s_%d' % (cifar10.TOWER_NAME, i)) as scope:
            # Dequeues one batch for the GPU
            image_batch, label_batch = batch_queue.dequeue()
            # Calculate the loss for one tower of the CIFAR model. This function
            # constructs the entire CIFAR model but shares the variables across
            # all towers.
            loss = tower_loss(scope, image_batch, label_batch)
            # Reuse variables for the next tower.
            tf.get_variable_scope().reuse_variables()
            # Retain the summaries from the final tower.
            summaries = tf.get_collection(tf.GraphKeys.SUMMARIES, scope)
            # Calculate the gradients for the batch of data on this CIFAR tower.
            grads = opt.compute_gradients(loss)
            # Keep track of the gradients across all towers.
            tower_grads.append(grads)
    # We must calculate the mean of each gradient. Note that this is the
    # synchronization point across all towers.
    grads = average_gradients(tower_grads)
......
    apply_gradient_op = opt.apply_gradients(grads, global_step=global_step)

We should keep in mind that these codes will only build a static graph (the ‘grads; are references rather than values).

def average_gradients(tower_grads):
  average_grads = []
  for grad_and_vars in zip(*tower_grads):
    # Note that each grad_and_vars looks like the following:
    #   ((grad0_gpu0, var0_gpu0), ... , (grad0_gpuN, var0_gpuN))
    grads = []
    for g, _ in grad_and_vars:
      # Add 0 dimension to the gradients to represent the tower.
      expanded_g = tf.expand_dims(g, 0)
      # Append on a 'tower' dimension which we will average over below.
      grads.append(expanded_g)
    # Average over the 'tower' dimension.
    grad = tf.concat(axis=0, values=grads)
    grad = tf.reduce_mean(grad, 0)
    # Keep in mind that the Variables are redundant because they are shared
    # across towers. So .. we will just return the first tower's pointer to
    # the Variable.
    v = grad_and_vars[0][1]
    grad_and_var = (grad, v)
    average_grads.append(grad_and_var)
  return average_grads

First, we need to expand dimensions of tensor(gradient) and concatenate them. Then use reduce_mean() to do actually average operation (seems not intuitive).

A basic example of using Tensorflow to regress

In theory of Deep Learning, even a network with single hidden layer could represent any function of mathematics. To verify it, I write a Tensorflow example as below:

import tensorflow as tf
hidden_nodes = 1024
def weight_variable(shape):
  """weight_variable generates a weight variable of a given shape."""
  initial = tf.truncated_normal(shape, stddev=1.0, mean=1.0)
  return tf.Variable(initial)
def bias_variable(shape):
  """bias_variable generates a bias variable of a given shape."""
  initial = tf.constant(0.01, shape=shape)
  return tf.Variable(initial)
with tf.device('/cpu:0'):
    x = tf.placeholder(tf.float32)
    y = tf.placeholder(tf.float32)
    a = tf.reshape(tf.tanh(x), [1, -1])
    b = tf.reshape(tf.square(x), [1, -1])
    basic = tf.concat([a, b], 0)
    with tf.name_scope('fc1'):
      W_fc1 = weight_variable([hidden_nodes, 2])
      b_fc1 = bias_variable([1])
      linear_model = tf.nn.relu(tf.matmul(W_fc1, basic) + b_fc1)
# loss
loss = tf.reduce_sum(tf.abs(linear_model - y)) # sum of the squares
# optimizer
optimizer = tf.train.GradientDescentOptimizer(1e-4)
train = optimizer.minimize(loss)
# training data
x_train = range(0, 10)
y_train = range(0, 10)
init = tf.global_variables_initializer()
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.Session(config=config)
sess.run(init) # reset values to wrong
for i in range(3000):
  sess.run(train, {x: x_train, y: y_train})
  # evaluate training accuracy
  curr_basic, curr_w, curr_a, curr_b, curr_loss = sess.run([basic, W_fc1, a, b, loss], {x: x_train, y: y_train})
  print("loss: %s" % (curr_loss))

In this code, it was trying to regress to a number from its own sine-value and cosine-value.
At first running, the loss didn’t change at all. After I changed learning rate from 1e-3 to 1e-5, the loss slowly went down as normal. I think this is why someone call Deep Learning a “Black Magic” in Machine Learning area.

Fix Resnet-101 model in example of MXNET

SSD(Single Shot MultiBox Detector) is the fastest method in object-detection task (Another detector YOLO, is a little bit slower than SSD). In the source code of MXNET，there is an example for SSD implementation. I test it by using different models: inceptionv3, resnet-50, resnet-101 etc. and find a weird phenomenon: the size .params file generated by resnet-101 is smaller than resnet-50.

Model	Size of .params file
resnet-50	119MB
resnet-101	69MB

Since deeper network have larger number of parameters, resnet-101 has smaller file size for parameters seems suspicious.
Reviewing the code of example/ssd/symbol/symbol_factory.py:

    elif network == 'resnet50':
        num_layers = 50
        image_shape = '3,224,224'  # resnet require it as shape check
        network = 'resnet'
        from_layers = ['_plus12', '_plus15', '', '', '', '']
        num_filters = [-1, -1, 512, 256, 256, 128]
        strides = [-1, -1, 2, 2, 2, 2]
        pads = [-1, -1, 1, 1, 1, 1]
        sizes = [[.1, .141], [.2,.272], [.37, .447], [.54, .619], [.71, .79], [.88, .961]]
        ratios = [[1,2,.5], [1,2,.5,3,1./3], [1,2,.5,3,1./3], [1,2,.5,3,1./3], \
            [1,2,.5], [1,2,.5]]
        normalizations = -1
        steps = []
        return locals()
    elif network == 'resnet101':
        num_layers = 101
        image_shape = '3,224,224'
        network = 'resnet'
        from_layers = ['_plus12', '_plus15', '', '', '', '']
        num_filters = [-1, -1, 512, 256, 256, 128]
        strides = [-1, -1, 2, 2, 2, 2]
        pads = [-1, -1, 1, 1, 1, 1]
        sizes = [[.1, .141], [.2,.272], [.37, .447], [.54, .619], [.71, .79], [.88, .961]]
        ratios = [[1,2,.5], [1,2,.5,3,1./3], [1,2,.5,3,1./3], [1,2,.5,3,1./3], \
            [1,2,.5], [1,2,.5]]
        normalizations = -1
        steps = []
        return locals()

Why resnet-50 and resnet-101 has the same ‘from_layers’ ? Let’s check these two models:

In resnet-50, the SSD use two layers (as show in red line) to extract features. One from output of stage-3, another from output of stage-4. In resnet-101, it should be the same (as show in blue line), but it incorrectly copy the config code of resnet-50. The correct ‘from_layers’ for resnet-100 is:

from_layers = ['_plus29', '_plus32', '', '', '', '']

This seems like a bug, so I create a pull request to try fixing it.

Robin on Linux

Monthly Archives: December 2017