Yearly Archives: 2017

Some tips about Tensorflow

Q: How to fix error report like

tensorflow.python.framework.errors_impl.InvalidArgumentError: Input 0 of node
Adam/update_embeddings/AssignSub was passed float from _recv_embeddings_0:0
incompatible with expected float_ref

A: We can’t feed a value into a variable and optimize it in the same time (So the problem only occurs when using Optimizers). Should using ‘tf.assign()’ in graph to give value to tf.Variable
Q: How to get a tensor by name?
A: like this:

tensor = tf.get_default_graph().get_tensor_by_name("example:0")

Q: How to get variable by name?
A:

my_var = [v for v in tf.global_variables() if v.name == "myvar_23:0"][0]

How to average gradients in Tensorflow

Sometimes, we need to average an array of gradients in deep learning model. Fortunately, Tensorflow divided models into fine-grained tensors and operations, therefore it’s not difficult to implement gradients average by using it.
Let’s see the code from github：

with tf.variable_scope(tf.get_variable_scope()):
      for i in xrange(FLAGS.num_gpus):
        with tf.device('/gpu:%d' % i):
          with tf.name_scope('%s_%d' % (cifar10.TOWER_NAME, i)) as scope:
            # Dequeues one batch for the GPU
            image_batch, label_batch = batch_queue.dequeue()
            # Calculate the loss for one tower of the CIFAR model. This function
            # constructs the entire CIFAR model but shares the variables across
            # all towers.
            loss = tower_loss(scope, image_batch, label_batch)
            # Reuse variables for the next tower.
            tf.get_variable_scope().reuse_variables()
            # Retain the summaries from the final tower.
            summaries = tf.get_collection(tf.GraphKeys.SUMMARIES, scope)
            # Calculate the gradients for the batch of data on this CIFAR tower.
            grads = opt.compute_gradients(loss)
            # Keep track of the gradients across all towers.
            tower_grads.append(grads)
    # We must calculate the mean of each gradient. Note that this is the
    # synchronization point across all towers.
    grads = average_gradients(tower_grads)
......
    apply_gradient_op = opt.apply_gradients(grads, global_step=global_step)

We should keep in mind that these codes will only build a static graph (the ‘grads; are references rather than values).

def average_gradients(tower_grads):
  average_grads = []
  for grad_and_vars in zip(*tower_grads):
    # Note that each grad_and_vars looks like the following:
    #   ((grad0_gpu0, var0_gpu0), ... , (grad0_gpuN, var0_gpuN))
    grads = []
    for g, _ in grad_and_vars:
      # Add 0 dimension to the gradients to represent the tower.
      expanded_g = tf.expand_dims(g, 0)
      # Append on a 'tower' dimension which we will average over below.
      grads.append(expanded_g)
    # Average over the 'tower' dimension.
    grad = tf.concat(axis=0, values=grads)
    grad = tf.reduce_mean(grad, 0)
    # Keep in mind that the Variables are redundant because they are shared
    # across towers. So .. we will just return the first tower's pointer to
    # the Variable.
    v = grad_and_vars[0][1]
    grad_and_var = (grad, v)
    average_grads.append(grad_and_var)
  return average_grads

First, we need to expand dimensions of tensor(gradient) and concatenate them. Then use reduce_mean() to do actually average operation (seems not intuitive).

A basic example of using Tensorflow to regress

In theory of Deep Learning, even a network with single hidden layer could represent any function of mathematics. To verify it, I write a Tensorflow example as below:

import tensorflow as tf
hidden_nodes = 1024
def weight_variable(shape):
  """weight_variable generates a weight variable of a given shape."""
  initial = tf.truncated_normal(shape, stddev=1.0, mean=1.0)
  return tf.Variable(initial)
def bias_variable(shape):
  """bias_variable generates a bias variable of a given shape."""
  initial = tf.constant(0.01, shape=shape)
  return tf.Variable(initial)
with tf.device('/cpu:0'):
    x = tf.placeholder(tf.float32)
    y = tf.placeholder(tf.float32)
    a = tf.reshape(tf.tanh(x), [1, -1])
    b = tf.reshape(tf.square(x), [1, -1])
    basic = tf.concat([a, b], 0)
    with tf.name_scope('fc1'):
      W_fc1 = weight_variable([hidden_nodes, 2])
      b_fc1 = bias_variable([1])
      linear_model = tf.nn.relu(tf.matmul(W_fc1, basic) + b_fc1)
# loss
loss = tf.reduce_sum(tf.abs(linear_model - y)) # sum of the squares
# optimizer
optimizer = tf.train.GradientDescentOptimizer(1e-4)
train = optimizer.minimize(loss)
# training data
x_train = range(0, 10)
y_train = range(0, 10)
init = tf.global_variables_initializer()
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.Session(config=config)
sess.run(init) # reset values to wrong
for i in range(3000):
  sess.run(train, {x: x_train, y: y_train})
  # evaluate training accuracy
  curr_basic, curr_w, curr_a, curr_b, curr_loss = sess.run([basic, W_fc1, a, b, loss], {x: x_train, y: y_train})
  print("loss: %s" % (curr_loss))

In this code, it was trying to regress to a number from its own sine-value and cosine-value.
At first running, the loss didn’t change at all. After I changed learning rate from 1e-3 to 1e-5, the loss slowly went down as normal. I think this is why someone call Deep Learning a “Black Magic” in Machine Learning area.

Fix Resnet-101 model in example of MXNET

SSD(Single Shot MultiBox Detector) is the fastest method in object-detection task (Another detector YOLO, is a little bit slower than SSD). In the source code of MXNET，there is an example for SSD implementation. I test it by using different models: inceptionv3, resnet-50, resnet-101 etc. and find a weird phenomenon: the size .params file generated by resnet-101 is smaller than resnet-50.

Model	Size of .params file
resnet-50	119MB
resnet-101	69MB

Since deeper network have larger number of parameters, resnet-101 has smaller file size for parameters seems suspicious.
Reviewing the code of example/ssd/symbol/symbol_factory.py:

    elif network == 'resnet50':
        num_layers = 50
        image_shape = '3,224,224'  # resnet require it as shape check
        network = 'resnet'
        from_layers = ['_plus12', '_plus15', '', '', '', '']
        num_filters = [-1, -1, 512, 256, 256, 128]
        strides = [-1, -1, 2, 2, 2, 2]
        pads = [-1, -1, 1, 1, 1, 1]
        sizes = [[.1, .141], [.2,.272], [.37, .447], [.54, .619], [.71, .79], [.88, .961]]
        ratios = [[1,2,.5], [1,2,.5,3,1./3], [1,2,.5,3,1./3], [1,2,.5,3,1./3], \
            [1,2,.5], [1,2,.5]]
        normalizations = -1
        steps = []
        return locals()
    elif network == 'resnet101':
        num_layers = 101
        image_shape = '3,224,224'
        network = 'resnet'
        from_layers = ['_plus12', '_plus15', '', '', '', '']
        num_filters = [-1, -1, 512, 256, 256, 128]
        strides = [-1, -1, 2, 2, 2, 2]
        pads = [-1, -1, 1, 1, 1, 1]
        sizes = [[.1, .141], [.2,.272], [.37, .447], [.54, .619], [.71, .79], [.88, .961]]
        ratios = [[1,2,.5], [1,2,.5,3,1./3], [1,2,.5,3,1./3], [1,2,.5,3,1./3], \
            [1,2,.5], [1,2,.5]]
        normalizations = -1
        steps = []
        return locals()

Why resnet-50 and resnet-101 has the same ‘from_layers’ ? Let’s check these two models:

In resnet-50, the SSD use two layers (as show in red line) to extract features. One from output of stage-3, another from output of stage-4. In resnet-101, it should be the same (as show in blue line), but it incorrectly copy the config code of resnet-50. The correct ‘from_layers’ for resnet-100 is:

from_layers = ['_plus29', '_plus32', '', '', '', '']

This seems like a bug, so I create a pull request to try fixing it.

“Eager Mode” in Tensorflow

Although Tensorflow is the most popular Deep Learning Framework in 2016, Pytorch, a smaller new framework developed by FAIR(Facebook AI Research)， become a dark horse this year. Pytorch supports Dynamic Graph Computing, which means you can freely add or remove layers in your model at runtime. It makes developer or scientist build new models more rapidly.
To fight back Pytorch, Tensorflow team add a new mechanism named “Eager Mode”, in which we could also use Dynamic Graph Computing. The example of “Eager Mode” looks like:

import tensorflow as tf
import tensorflow.contrib.eager as tfe
tfe.enable_eager_execution() #Enalbe Eager Execution Mode
x = [[2.]]
m = tf.matmul(x, x)
print(m)

As above, unlike traditional Tensorflow application that use “Session.run()” to execute whole graph, developers could see values and gradients of variables in any layer at any step.
How did Tensorflow do it? Actually, the tricks behind the API is not difficult. Take the most common Operation ‘matmul’ as example:

# file: tensorflow/python/ops/math_ops.py
def matmul(a,
           b,
           transpose_a=False,
           transpose_b=False,
           adjoint_a=False,
           adjoint_b=False,
           a_is_sparse=False,
           b_is_sparse=False,
           name=None):
......
    if use_sparse_matmul:
      return sparse_matmul(
          a,
          b,
          transpose_a=transpose_a,
          transpose_b=transpose_b,
          a_is_sparse=a_is_sparse,
          b_is_sparse=b_is_sparse,
          name=name)
    else:
      return gen_math_ops._mat_mul(
          a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)

Le’t look into “gen_math_ops._mat_mul()”:

# file: bazel-genfiles/tensorflow/python/ops/gen_math_ops.py
def _mat_mul(a, b, transpose_a=False, transpose_b=False, name=None):
......
  if _ctx.in_graph_mode():
    _, _, _op = _op_def_lib._apply_op_helper(
        "MatMul", a=a, b=b, transpose_a=transpose_a, transpose_b=transpose_b,
        name=name)
    _result = _op.outputs[:]
    _inputs_flat = _op.inputs
    _attrs = ("transpose_a", _op.get_attr("transpose_a"), "transpose_b",
              _op.get_attr("transpose_b"), "T", _op.get_attr("T"))
  else:
    _attr_T, _inputs_T = _execute.args_to_matching_eager([a, b], _ctx)
    (a, b) = _inputs_T
    _inputs_flat = [a, b]
    _attrs = ("transpose_a", transpose_a, "transpose_b", transpose_b, "T",
              _attr_T)
    _result = _execute.execute(b"MatMul", 1, inputs=_inputs_flat,
                               attrs=_attrs, ctx=_ctx, name=name)
  _execute.record_gradient(
      "MatMul", _inputs_flat, _attrs, _result, name)
  _result, = _result
  return _result

As we can see, in Graph Mode, it will go to “_apply_op_helper()” to build graph (but not running it). In Eager Mode, it will execute the Operation directly.

Training DNN with less memory cost

The paper “Training Deep Nets with Sublinear Memory Cost” tells us a practical method to train DNN with far less memory cost. The mechanism behind is not difficult to understand: when training a deep network (a computing graph), we have to store temporary data in every node, which will occupy extra memory. Actually, we could remove these temporary data after computing each node, and compute them again in back-propagation period. It’s a tradeoff between computing time and computing space.
The author give us an example in MXNET. The improvement of memory-reducing seems tremendous.
Above the version 1.3, tensorflow also brought a similar module: memory optimizer. We can use it like this:

from tensorflow.core.protobuf import rewriter_config_pb2
....
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
config.allow_soft_placement = True
config.graph_options.rewrite_options.memory_optimization = memory_optimization=rewriter_config_pb2.RewriterConfig.MANUAL
with tf.Session(config=config) as sess:
....

Still need to add op in Resnet:

# Add 'reshape' with a special name at the end of every residual-unit
shape = x.get_shape()
dims = []
for i in range(shape.ndims):
    dims.append(shape.dims[i].value)
x = tf.reshape(x, dims, name='robin_137')
....
def ignore(op):
    return op.type in ['Const', 'VariableV2', 'Identity', 'Assign', 'Placeholder', 'RandomShuffleQueueV2', 'QueueEnqueueV2', 'QueueDequeueManyV2']
def checkpoint(op):
    m = re.compile(".*robin_137").match(op.name)
    if m:
        return True
    return False
# Set attribute here
ops = tf.get_default_graph().get_operations()
mirrors = filter(lambda op: not ignore(op) and not checkpoint(op), ops)
for op in mirrors:
    op.node_def.attr['_recompute_hint'].i = 0

By using this method, we could increase batch-size even in deep network (Resnet-101 etc.) now.

The performance of R-CNN in mxnet

We are trying to use faster R-CNN network (also is an example in mxnet) to automatically extract bird from pictures. But it will cost 10 seconds to recognize a bird from a picture by using CPU, which is too slow to be used in product environment. To improve the performance, I download the MKL with version-2017u4 from Intel site and install it in the server. After recompile mxnet:

make clean
make USE_BLAS=openblas USE_CUDNN=1 USE_CUDA_PATH=/usr/local/cuda-8.0/ USE_CUDA=1 USE_MKL2017=1 -j

it only cost 3~4 seconds to recognize bird from picture. MKL really works!
Using GPU to do inference is a another option. But a EC2 instance with a GPU device is much more expensive than a normal EC2 instance. So we will still using CPU in the near future.

Price of EC2 instance in US-West(Oregon)

	vCPU	ECU	Memory (GiB)	Instance Storage (GB)	Linux/UNIX Usage
t2.large	2	Variable	8	EBS Only	$0.0928 per Hour
g2.2xlarge	8	26	15	60 SSD	$0.65 per Hour

Use Mxnet To Classify Images Of Birds (Fourth Episode)

More than half a year past since previous article. In this period, Alan Mei (my old ex-colleague) collected more than 1 million pictures of Chinese Avians. And after Alexnet, VGG19, I finally chose Resnet-18 as my DNN model to classify different kinds of Chinese birds. Resnet-18 model has far less parameters of network than VGG19, but still get enough capability of representation.
Collecting more than 1 million sample pictures of birds and label them (some by the program, and some by hand) is really a tedious job. I really appreciate Alan Mei for accepting so hard a job, although he said he is an Avian fan :). And I also need to thank him for giving me a Personal Computer with GTX970 GPU. Without this GPU, I would not train my model so fast.
To make the accuracy of classifying better, I have read the book “Deep Learning” and many other papers (Not only Resnet-paper, of course). The reward of knowledge about machine learning and deep learning is abundant for me. But the most important of all is: I enjoyed the learning of new technology again.
Today, we launch this simple web: http://en.dongniao.net/ . In the Chinese language, “dongniao” means “Understanding Avians”. Hope the Avian-Fans and Depp Learning Fans will love it.

Using Spark-SQL to transfer CSV file to Parquet

After downloading data from “Food and Agriculture Organization of United Nations”, I get many CSV files. One of the file is named “Trade_Crops_Livestock_E_All_Data_(Normalized).csv” and it looks like:

Area Code,Area,Item Code,Item,Element Code,Element,Year Code,Year,Unit,Value,Flag
"2","Afghanistan","231","Almonds shelled","5910","Export Quantity","1961","1961","tonnes","0.000000",""
"2","Afghanistan","231","Almonds shelled","5910","Export Quantity","1962","1962","tonnes","0.000000",""
"2","Afghanistan","231","Almonds shelled","5910","Export Quantity","1963","1963","tonnes","0.000000",""
......

To load this CSV file into Spark and dump it to Parquet format, I wrote these codes:

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.rdd._
/* Area Code,Area,Item Code,Item,Element Code,Element,Year Code,Year,Unit,Value,Flag */
case class Trade(area_code:Int, area:String, item_code:Int, item:String, element_code:Int,
                 element:String, year:Int, unit:String, value:Double, flag:String)
object TradeCrops {
  def scrub(str:String):String = {
    return str.replace("\"", "")
  }
  def toInt(str:String):Int = {
    try {
      return scrub(str).toInt
    } catch {
      case e:Throwable => {
        return 0
      }
    }
  }
  def toDouble(str:String):Double = {
    try {
      return scrub(str).toDouble
    } catch {
      case e:Throwable => {
        return 0
      }
    }
  }
  def toTrade(line:String):Trade = {
    val fields = line.split("\",")
    Trade(
      toInt(fields(0)),
      scrub(fields(1)),
      toInt(fields(2)),
      scrub(fields(3)),
      toInt(fields(4)),
      scrub(fields(5)),
      toInt(fields(7)),
      scrub(fields(8)),
      toDouble(fields(9)),
      scrub(fields(10))
    )
  }
  def main(args: Array[String]) {
    val conf = new SparkConf()
      .setAppName("Trade Crops Application")
    val sc = new SparkContext(conf)
    val spark = SparkSession.builder()
        .appName("Spark SQL Trade Crops")
        .getOrCreate()
    val file = sc.textFile("hdfs:///FAO/Trade_Crops.csv")
    val tradeRDD = file.filter(_.split("\",").length == 11).map(toTrade(_))
    val tradeDF = spark.createDataFrame(tradeRDD)
    tradeDF.write.parquet("hdfs:///FAO/Trade_Crops.parquet")
  }
}

The build.sbt is

lazy val root = (project in file("."))
  .settings(
    name := "FAO",
    version := "1.0",
    scalaVersion := "2.11.7",
    unmanagedJars in Compile += file("/usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples-2.6.0-cdh5.7.6.jar"),
    libraryDependencies ++= Seq(
      "org.apache.spark" % "spark-core_2.11" % "2.1.1",
      "org.apache.spark" % "spark-sql_2.11" % "2.1.1",
      "org.apache.hadoop" % "hadoop-client" % "2.6.0"
    )
  )

Always remember to add dependency for “spark-sql” or else it will report “createDataFrame() if not a member of spark”.
And finally, the submit script is:

/disk1/spark-2.1.1-bin-hadoop2.6/bin/spark-submit --class TradeCrops \
  --master yarn \
  --driver-memory 2G \
  --executor-memory 2G \
  --executor-cores 1 \
  --num-executors 64 \
  ./target/scala-2.11/FAO_2.11-1.0.jar

Importance of function’s return Type in Scala

The Scala code is below:

abstract class Human {
        def call
}
class Student(val name: String) extends Human {
        override def call = "\"" + name + "\""
}
object MyApp extends App {
        var s:Student = new Student("robin")
        println(s.call)
}

But the output of this section of code is:

()

Nothing, just a pair of parenthesis. What happen? Why doesn’t Scala use the ‘call’ function in subclass ‘Student’?
The answer is: we forget to define the return Type of function ‘call’, so its default return Type is ‘Unit’. The value of ‘Unit’ is only ‘()’. Hence no matter what value we give to ‘call’, it only return ‘()’.
We just to add return Type:

abstract class Human {
        def call: String
}
......

It print “robin” now.