Monthly Archives: July 2018

Do tf.random_crop() operation on GPU

When I run code like:

with tf.device('/GPU:0'):
  images = tf.random_crop(images, [IMAGE_HEIGHT, IMAGE_WIDTH, IMAGE_CHANNELS])
...

it reports:

Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available.

Looks operation tf.random_crop() doen’t have CUDA kernel implementation. Therefore I need to write it myself. The solution is surprisingly simple: write a function to do random_crop on one image by using tf.random_uniform() and tf.slice(), and then use tf.map_fn() to apply it on multi-images.

def my_random_crop(value, size):
    shape = tf.shape(value)
    size = tf.convert_to_tensor(size, dtype = tf.int32)
    limit = shape - size + 1
    offset = tf.random_uniform(tf.shape(shape), dtype = size.dtype, maxval = size.dtype.max) % limit
    return tf.slice(value, offset, size)
...
images = tf.map_fn(lambda img: my_random_crop(img, [IMAGE_HEIGHT, IMAGE_WIDTH, IMAGE_CHANNELS]), images)

It can run on GPU now.

Regularization loss in ‘slim’ library of Tensorflow

My python code using slim library to train classification model in Tensorflow:

    with tf.contrib.slim.arg_scope(mobilenet_v2.training_scope(weight_decay = 0.001)):
      logits, _ = mobilenet_v2.mobilenet(images, NUM_CLASSES)
    cross_entropy = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)
    cross_entropy = tf.reduce_mean(cross_entropy)
    global_step = tf.contrib.framework.get_or_create_global_step()
    train_op = tf.contrib.slim.learning.create_train_op(cross_entropy, opt, global_step = global_step)
...
    sess.run(train_op)

It works fine. However, no matter what value the ‘weight_decay’ is, the training accuracy of the model could reach higher than 90% easily. It seems ‘weight_decay’ just doesn’t work.
In order to find out the reason, I reviewed the code of Tensorflow for ‘tf.losses.sparse_softmax_cross_entropy()’:

# tensorflow/python/ops/losses/losses_impl.py
@tf_export("losses.sparse_softmax_cross_entropy")
def sparse_softmax_cross_entropy(
    labels, logits, weights=1.0, scope=None,
    loss_collection=ops.GraphKeys.LOSSES,
    reduction=Reduction.SUM_BY_NONZERO_WEIGHTS):
...
  with ops.name_scope(scope, "sparse_softmax_cross_entropy_loss",
                      (logits, labels, weights)) as scope:
    # As documented above in Args, labels contain class IDs and logits contains
    # 1 probability per class ID, so we expect rank(logits) - rank(labels) == 1;
    # therefore, expected_rank_diff=1.
    labels, logits, weights = _remove_squeezable_dimensions(
        labels, logits, weights, expected_rank_diff=1)
    losses = nn.sparse_softmax_cross_entropy_with_logits(labels=labels,
                                                         logits=logits,
                                                         name="xentropy")
    return compute_weighted_loss(
        losses, weights, scope, loss_collection, reduction=reduction)

The ‘losses.sparse_softmax_cross_entropy()’ simply call ‘tf.nn.sparse_softmax_cross_entropy()’. Then let’s look into the implementation of ‘compute_weighted_loss()’:

# tensorflow/python/ops/losses/losses_impl.py
@tf_export("losses.compute_weighted_loss")
def compute_weighted_loss(
    losses, weights=1.0, scope=None, loss_collection=ops.GraphKeys.LOSSES,
    reduction=Reduction.SUM_BY_NONZERO_WEIGHTS):
...
      loss = math_ops.cast(loss, input_dtype)
      util.add_loss(loss, loss_collection)
      return loss
What the secret in 'util.add_loss()'?
# tensorflow/python/ops/losses/util.py
@tf_export("losses.add_loss")
def add_loss(loss, loss_collection=ops.GraphKeys.LOSSES):
...
  if loss_collection:
    ops.add_to_collection(loss_collection, loss)

The losses of 'losses.sparse_softmax_cross_entropy()' will be added into collection of 'GraphKeys.LOSSES'. Then where dose the weight of parameters go ? Will they be added into same collection ? Let's check. All the layer written by library of 'tf.layers' or 'tf.contrib.slim' are inherited from 'class Layer' and will call 'add_loss()' when this layer call 'add_variable()'. Let's check 'add_loss()' of base class 'Layer':
@tf_export('layers.Layer')
class Layer(checkpointable.CheckpointableBase):
...
    def add_loss(self, losses, inputs=None):
        ...
        _add_elements_to_collection(losses, ops.GraphKeys.REGULARIZATION_LOSSES)

It's weird. The loss from weight of variable has not been added into 'GraphKeys.LOSSES', but 'GraphKeys.REGULARIZATION_LOSSES'. Then how could we get all the losses at training stage ? After grep 'REGULARIZATION_LOSSES' in whole codes of Tensorflow, it comes up with the 'get_total_loss()':
# tensorflow/python/ops/losses/util.py
@tf_export("losses.get_total_loss")
def get_total_loss(add_regularization_losses=True, name="total_loss"):
...
  losses = get_losses()
  if add_regularization_losses:
    losses += get_regularization_losses()
  return math_ops.add_n(losses, name=name)

That is the secret of losses in 'tf.layers' and 'tf.contrib.slim': we should use 'get_total_loss()' to fetch model loss and regularization loss together!

After changing my code:
    cross_entropy = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)
    cross_entropy = tf.reduce_mean(cross_entropy)
    global_step = tf.contrib.framework.get_or_create_global_step()
    loss = tf.contrib.slim.losses.get_total_loss()
    train_op = tf.contrib.slim.learning.create_train_op(loss, opt, global_step = global_step)
...
    sess.run(train_op)

The 'weight_decay' works well now (which means training accuracy could not reach high value easily)

Using multi-GPUs for training in distributed environment of Tensorflow

I am trying to write code for training on multi-GPUs. The code is mainly from the example of ‘Distributed Tensorflow‘. I have changed the code slightly for runing on GPU:

...
tf.train.replica_device_setter(
        worker_device="/job:worker/task:%d/GPU:%d" % (FLAGS.task_index, FLAGS.task_index),
        cluster=cluster)
...

But after launch the script below:

python model.py train 0.9 0.0001 0.53 ps 0 &> ps.log &
python model.py train 0.9 0.0001 0.53 worker 0 &> worker0.log &
python model.py train 0.9 0.0001 0.53 worker 1 &> worker1.log &
...

it reports:

Traceback (most recent call last):
  File "model.py", line 175, in 
    server = tf.train.Server(cluster, job_name = job_name, task_index = task_index)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/server_lib.py", line 147, in __init__
    self._server_def.SerializeToString(), status)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 519, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InternalError: failed initializing StreamExecutor for CUDA device ordinal 2: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY; total memory reported: 11721506816

Seems one MonitoredTrainingSession will occupy all the memory of GPUs. After search on google, I finally get a solution: ‘CUDA_VISIBLE_DEVICES’.
Firstly, change ‘replica_device_setter’:

...
tf.train.replica_device_setter(
        worker_device="/job:worker/task:%d/GPU:0" % FLAGS.task_index,
        cluster=cluster)
...

and then use this shell script to launch training processes:

CUDA_VISIBLE_DEVICES=0 python model.py train 0.9 0.0001 0.53 ps 0 &> ps.log &
sleep 1
for i in `seq 0 2`; do
  dev=`expr ${i} + 1`
  CUDA_VISIBLE_DEVICES=${dev} stdbuf -o0 python model.py train 0.9 0.0001 0.53 worker ${i} &> worker_${i}.log &
  sleep 1
done

The ‘ps’ will only use GPU0, ‘worker0’ will only use GPU1, ‘worker1’ will only use GPU2 etc.

Reinforcement Learning example for tree search

I have been learning Reinforcement Learning for about two weeks. Although haven’t go through all the course of Arthur Juliani, I had been able to write a small example of Q-learning now.
This example is about using DNN for Q-value table to solve a path-finding-problem. Actually, the path is more looks like a tree:

The start point is ‘0’, and the destination (or ‘goal’) is ’12’.
The code framework of my example is mainly from Manuel Amunategui’s tutorial but replacing Q-value table with a one-layer-neural-network.

import tensorflow as tf
import tensorflow.contrib.slim as slim
import numpy as np
import pylab as plt
MATRIX_SIZE = 15
goal = 12
points_list = [(0,1), (0,2), \
        (1,3), (1,4), (2,5), (2,6), \
        (3,7), (3,8), (4,9), (4,10), \
        (5,11), (5,12), (6,13), (6,14)]
# Build feed-forward network by using 'state' as input, 'best action' as output
state_in = tf.placeholder(tf.int32, [1])
state_oh = slim.one_hot_encoding(state_in, 15)
output = slim.fully_connected(state_oh, 15,
        biases_initializer = None, activation_fn = tf.nn.relu,
        weights_initializer = tf.ones_initializer())
outputQ = tf.reshape(output, [-1])
chosen_action = tf.argmax(outputQ, 0)
nextQ = tf.placeholder(tf.float32, [15])
loss = tf.reduce_sum(tf.square(nextQ - outputQ))
# Gradient Descent Optimizer usually have better generalization performance
optimizer = tf.train.GradientDescentOptimizer(0.1)
update = optimizer.minimize(loss)
# Build reward matrix
R = np.matrix(np.ones(shape=(MATRIX_SIZE, MATRIX_SIZE)))
# Set extremely low reward (minus) for unconnected nodes
R *= -1000
for point in points_list:
    if point[1] == goal:
        R[point] = 100
    else:
        R[point] = 0
    if point[0]== goal:
        R[point[::-1]] = 100
    else:
        R[point[::-1]] = 0
R[goal, goal] = 100
# learning parameter
gamma = 0.9
# Epsilon-Greedy Algorithm
e = 0.1
# Training
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    reward_list = []
    for j in range(50):
        all_reward = 0
        for i in range(10):
            current_state = np.random.randint(0, 15)
            # Use current state to predict best action
            action, allQ = sess.run([chosen_action, outputQ],
                    feed_dict = {state_in: [current_state]})
            if np.random.rand(1) < e:
                action = np.random.randint(0, 15, dtype = np.int)
            new_state = action
            Q1 = sess.run(outputQ,
                    feed_dict = {state_in: [new_state]})
            maxQ1 = np.max(Q1)
            reward = R[current_state, action]
            targetQ = allQ
            targetQ[action] = reward + gamma * maxQ1
            # Use next state and next Q-values to train neural network
            _, read_loss = sess.run([update, loss],
                    feed_dict = {state_in: [current_state], nextQ: targetQ})
            all_reward += reward
        reward_list.append(all_reward)
    # show curve of reward in different training steps
    plt.plot(reward_list)
    plt.show()
    # Testing
    current_state = 0
    steps = [current_state]
    while current_state != goal:
        action = sess.run([chosen_action],
                feed_dict = {state_in: [current_state]})
        steps.append(action[0])
        current_state = action[0]
    print("Most efficient path:")
    print(steps)

The rewards curve in training steps:

And this example will finally report:

Most efficient path:
[0, 2, 5, 12]

which is the correct answer.

Robin on Linux

Monthly Archives: July 2018