Monthly Archives: November 2017

“Eager Mode” in Tensorflow

Although Tensorflow is the most popular Deep Learning Framework in 2016, Pytorch, a smaller new framework developed by FAIR(Facebook AI Research)， become a dark horse this year. Pytorch supports Dynamic Graph Computing, which means you can freely add or remove layers in your model at runtime. It makes developer or scientist build new models more rapidly.
To fight back Pytorch, Tensorflow team add a new mechanism named “Eager Mode”, in which we could also use Dynamic Graph Computing. The example of “Eager Mode” looks like:

import tensorflow as tf
import tensorflow.contrib.eager as tfe
tfe.enable_eager_execution() #Enalbe Eager Execution Mode
x = [[2.]]
m = tf.matmul(x, x)
print(m)

As above, unlike traditional Tensorflow application that use “Session.run()” to execute whole graph, developers could see values and gradients of variables in any layer at any step.
How did Tensorflow do it? Actually, the tricks behind the API is not difficult. Take the most common Operation ‘matmul’ as example:

# file: tensorflow/python/ops/math_ops.py
def matmul(a,
           b,
           transpose_a=False,
           transpose_b=False,
           adjoint_a=False,
           adjoint_b=False,
           a_is_sparse=False,
           b_is_sparse=False,
           name=None):
......
    if use_sparse_matmul:
      return sparse_matmul(
          a,
          b,
          transpose_a=transpose_a,
          transpose_b=transpose_b,
          a_is_sparse=a_is_sparse,
          b_is_sparse=b_is_sparse,
          name=name)
    else:
      return gen_math_ops._mat_mul(
          a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)

Le’t look into “gen_math_ops._mat_mul()”:

# file: bazel-genfiles/tensorflow/python/ops/gen_math_ops.py
def _mat_mul(a, b, transpose_a=False, transpose_b=False, name=None):
......
  if _ctx.in_graph_mode():
    _, _, _op = _op_def_lib._apply_op_helper(
        "MatMul", a=a, b=b, transpose_a=transpose_a, transpose_b=transpose_b,
        name=name)
    _result = _op.outputs[:]
    _inputs_flat = _op.inputs
    _attrs = ("transpose_a", _op.get_attr("transpose_a"), "transpose_b",
              _op.get_attr("transpose_b"), "T", _op.get_attr("T"))
  else:
    _attr_T, _inputs_T = _execute.args_to_matching_eager([a, b], _ctx)
    (a, b) = _inputs_T
    _inputs_flat = [a, b]
    _attrs = ("transpose_a", transpose_a, "transpose_b", transpose_b, "T",
              _attr_T)
    _result = _execute.execute(b"MatMul", 1, inputs=_inputs_flat,
                               attrs=_attrs, ctx=_ctx, name=name)
  _execute.record_gradient(
      "MatMul", _inputs_flat, _attrs, _result, name)
  _result, = _result
  return _result

As we can see, in Graph Mode, it will go to “_apply_op_helper()” to build graph (but not running it). In Eager Mode, it will execute the Operation directly.

Training DNN with less memory cost

The paper “Training Deep Nets with Sublinear Memory Cost” tells us a practical method to train DNN with far less memory cost. The mechanism behind is not difficult to understand: when training a deep network (a computing graph), we have to store temporary data in every node, which will occupy extra memory. Actually, we could remove these temporary data after computing each node, and compute them again in back-propagation period. It’s a tradeoff between computing time and computing space.
The author give us an example in MXNET. The improvement of memory-reducing seems tremendous.
Above the version 1.3, tensorflow also brought a similar module: memory optimizer. We can use it like this:

from tensorflow.core.protobuf import rewriter_config_pb2
....
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
config.allow_soft_placement = True
config.graph_options.rewrite_options.memory_optimization = memory_optimization=rewriter_config_pb2.RewriterConfig.MANUAL
with tf.Session(config=config) as sess:
....

Still need to add op in Resnet:

# Add 'reshape' with a special name at the end of every residual-unit
shape = x.get_shape()
dims = []
for i in range(shape.ndims):
    dims.append(shape.dims[i].value)
x = tf.reshape(x, dims, name='robin_137')
....
def ignore(op):
    return op.type in ['Const', 'VariableV2', 'Identity', 'Assign', 'Placeholder', 'RandomShuffleQueueV2', 'QueueEnqueueV2', 'QueueDequeueManyV2']
def checkpoint(op):
    m = re.compile(".*robin_137").match(op.name)
    if m:
        return True
    return False
# Set attribute here
ops = tf.get_default_graph().get_operations()
mirrors = filter(lambda op: not ignore(op) and not checkpoint(op), ops)
for op in mirrors:
    op.node_def.attr['_recompute_hint'].i = 0

By using this method, we could increase batch-size even in deep network (Resnet-101 etc.) now.

The performance of R-CNN in mxnet

We are trying to use faster R-CNN network (also is an example in mxnet) to automatically extract bird from pictures. But it will cost 10 seconds to recognize a bird from a picture by using CPU, which is too slow to be used in product environment. To improve the performance, I download the MKL with version-2017u4 from Intel site and install it in the server. After recompile mxnet:

make clean
make USE_BLAS=openblas USE_CUDNN=1 USE_CUDA_PATH=/usr/local/cuda-8.0/ USE_CUDA=1 USE_MKL2017=1 -j

it only cost 3~4 seconds to recognize bird from picture. MKL really works!
Using GPU to do inference is a another option. But a EC2 instance with a GPU device is much more expensive than a normal EC2 instance. So we will still using CPU in the near future.

Price of EC2 instance in US-West(Oregon)

	vCPU	ECU	Memory (GiB)	Instance Storage (GB)	Linux/UNIX Usage
t2.large	2	Variable	8	EBS Only	$0.0928 per Hour
g2.2xlarge	8	26	15	60 SSD	$0.65 per Hour

Robin on Linux

Monthly Archives: November 2017