## Hard training works in deep learning

This week, I was trying to train two deep-learning models. They are different from my previous training job: they are really hard to converge to a small ‘loss’.

The first model is about bird image classification. Previously we wrote a modified Resnet-50 model by using MXNet and could use it to reach 78% evaluation-accuracy. But after we rewrote the same model by using Tensorflow, it could only reach 50% evaluation-accuracy, which seems very weird. The first thing that in my mind is that it’s a regularization problem, so I randomly pad/crop and rotate the training images:

By data augmentation, the evaluation accuracy rise to about 60%, but still far from the result of MXNet.
Then I change the optimizer from AdamOptimizer to GradientDescentOptimizer, since my colleague tell me the AdamOptimizer is too powerful that it tends to cause overfit. And I also add ‘weight_decay’ for my Resnet-50 model. This time, the evaluation accuracy shrived to 76%. The affection of ‘weight_decay’ is significantly positive.

The second model is about object detection. We just use the example of Tensorflow’s model library. It includes many cutting-edge models to implement object detection. I just want to try SSD(Single Shot Detection) on MobileNetV2:

The loss is rapidly reducing from hundreds to twelve, but stay at eleven for a very long time. The loss looks like will stay here forever. Then I begin to adjust hyper-parameters. After testing several learning rates and optimizers, the results doesn’t change at all.
Eventually, I noticed that the loss doesn’t stay forever, it WILL REDUCE AT LAST. For some tasks such as classification, its loss will converge significantly. But for other tasks such as object detection, its loss will shrink at extremely slow speed. Use AdamOptimizer and small learning rate is a better choice for this type of task.

## To check abnormal loss value when training a new model

Yesterday I wrote a Tensorflow program to train CIFAR100 dataset with Resnet-50 model. But when the training begin, I saw the ‘loss’ of classification is abnormally big and didn’t reduce at all:

Firstly, I thought the code for processing dataset may be wrong. But after print out the data in console, the loading input data seems all right. Then I print all the value of tensors right after initialization of model. And these value seems correct either.
Without other choices, I began to check the initializer in Tensorflow code:

If the loss is too big, maybe I could decrease the initial value of tensors in model? Then I change ‘mean’ from ‘0’ to ‘0.1’ for ‘slim.conv2d’:

But the loss seems more crazy:

I have to change ‘mean’ and ‘stddev’ again:

This time, the loss seems correct now.

This is the first time I saw that initialized value could make the training accuracy so different.

## An example for running operation before fetching data in Tensorflow

In tensorflow, what should we do if we want run something before fetching data (such as, using queue in tensorflow)? Here is an example tested by myself:

It will print

Successfully, we add an operation before enqueue a item into queue.

## Why my model doesn’t converge?

To use Resnet-50 to run CIFAR100 dataset, I wrote a program by using Tensorflow. But when running it, the loss seems keeping in about 4.5~4.6 forever:

After changed models (from Resnet to fully-connect-net), optimizers (from AdamOptimizer to AdagradOptimizer), and even learning rate (from 1e-3 to even 1e-7), the phenomena didn’t change at all.
Finally, I checked the loss and the output vector step by step, and found that the problem is not in model but dataset code:

Every batch of data have the same pictures and same labels! Than’t why the model didn’t converge. I should have used ‘i’ instead of ‘self.pos’ as index to fetch data and labels.

So in DeepLearning area, problems comes not only from models and hyper-parameters, but also dataset, or faulty codes…

## Problem about using slim.batch_norm() of Tensorflow (second episode)

In previous article, I have found out the reason. But how to resolve it on Multi-GPU-Training is still a question. As the suggestion of this issue in github, I tried two way to fix the problem:

First, rewrite my Averaging-Gradients-Training to learn tf.slim.create_train_op():

But unfortunately, this didn’t work at all. The inference result was still a mess.

Then, another way, I use Asynchronous-Gradient-Training and tf.slim.create_train_op():

Now the inference works very well! And the training speed become a little bit faster than Averaging-Gradients-Training, for the Averaging Operation needs to wait multi gradients from multi GPUs.

## Problem about using slim.batch_norm() of Tensorflow

After using resnet_v2_50 in tensorflow/models, I found that the inference result is totally incorrect, though the training accuracy looks very well.
Firstly, I suspected the regularization of samples:

Indeed I had extended the image to a too big size. But after I changing padding size to ’10’, the inference accuracy was still incorrect.
Then I checked the code about importing data:

and changed my inference code as the data importing routines. But the problem still existed.

About one week past. Finally, I found this issue in Github. It explains all my questions: the cause is the slim.batch_norm(). After I adding these code to my program (learning from slim.create_train_op()):

The inference accuracy is — still low. Without other choice, I removed all slim.batch_norm() in resnet_v2.py, and at this time inference accuracy becomes the same with training accuracy.
Looks problem partly been solved, but I still need to find out why sli.batch_norm() doesn’t work well in inference …

## Experiment for distributed Tensorflow

Here is my experimental code for distributed Tensorflow, which is learned from the example.

The important thing is that we need to use tf.assign() to push Variable back to Parameter Server. The operation ‘tf.add’ was about to run on the task0 of worker in this example. But if we deploy more complicated application by many tasks, things became weird: a pipeline operation sometimes even runs on ‘ps’ role! The official solution to this problem is using ‘tf.train.replica_device_setter()’， which will automatically deploy Variables to parameter servers and Operations (many replicas) to many workers. What did ‘tf.train.replica_device_setter()’ do? Let’s see the backbone code of its implementation:

All the Variables will be counted as ‘ps_ops’, and the deploy strategy for Operations will be replication, for it’s called ‘_ReplicaDeviceChooser’.

All the ‘op’ in ‘self._ps_ops’ will be put into ‘ps_device’.

## The CSE (Common Subexpression Elimination) problem about running custom operation in Tensorflow

Recently, we create a new custom operation in Tensorflow:

It’s as simple as the example in Tensorflow’s document. But when we run this Op in session:

It only get image_ids from network once, and then use the result of first ‘run’ forever, without even call ‘Compute()’ function in cpp code again!

Seems Tensorflow optimized the new Op and never run it twice. My colleague give a suggestion to solve this problem by using tf.placeholder:

Looks a little tricky. The final solution is add flag in cpp code to let new Op to avoid CSE (Common Subexpression Elimination):

Attachment of the ‘CMakeLists.txt’:

## Compute gradients of different part of model in Tensorflow

In Tensorflow, we could use Optimizer to train model:

But sometimes, model need to be split to two parts and trained separately, so we need to compute gradients and apply them by two steps:

Then how could we delivery gradients from first part to second part? Here is the equation to answer:

$\frac{\partial Loss} {\partial W_{second-part}} = \frac{\partial Loss} {\partial IV} \cdot \frac{\partial IV} {\partial W_{second-part}}$

The $IV$ means ‘intermediate vector’, which is the interface vector between first-part and second-part and it is belong to both first-part and second-part. The $W_{second-part}$ is the weights of second part of model. Therefore we could use tf.gradients() to connect gradients of two parts:

## Technical Meeting with Nvidia Corporation

Last week I went to Nvidia Corporation of Santa Clara (California) with my colleagues to join a technical meeting about cutting-edge hardware and software of Deep Learning.

The new office building of NVIDIA

At first day, team leaders from Nvidia introduced their developing plan of new hardware and software. The new hardwares are about Tesla V100, NVLink, and HGX (next generation of DGX). And the softwares are about CUDA-9.2 NCCL-2.0, and TensorRT-3.0

Here are some notes about their introducing:

• The next generation of Tesla P4 GPU will have tensor-core, 16GB memory, and H264 decoder (performance as Tesla P100) for better inference performance, especially for image/video processing.
• The software support of tensor-core (mainly in Tesla V100 GPU) has been integrated in Tensorflow-1.5 version.
• The TensorRT could turn three layers of Deep Learning (Conv layer, Bias layer, Relu layer) to one CBR layer, eliminate concatenation layers, to accelerate inference computing.
• The tool ‘nvidia-smi’ could show ‘util’ of GPU. But ‘80%’ util only means this GPU run task (no matter how many CUDA-cores has been used) for 0.8 second in one second period. Therefore it’s not a accurate metrics for real GPU load. NVPROF is the much powerful and accurate tool for profiling of GPU

The TITAN V GPU

At second day, many teams from Alibaba (my company) ask Nvidia different questions. Here are some questions and answers:

Q: Some Deep Learning Compilers such as XLA (Google) and TVM(from AWS) could compile python code to GPU intermediate representation directly. How will Nvidia work with these application-oriented compiler?

A: The google XLA team will be shut off and move to optimize TPU performance only. Nvidia will still focus on library such as CUDA/cuDNN/TensorRT and will not build frameworks like Tensorflow or Mxnet.

Q: There are many new hardwares launched for Deep Learning: Google’s TPU, some ASICs developed by other companies. How will Nvidia keep cost performance over these new competitors?

A: ASICs are not programmable. If models of Deep Learning changes, the ASIC will be in trash. For example, TPU has Relu/Conv instructions, but if it comes new type of activation function, it will not work anymore. Furthermore, customers can only run TPU on Google’s cloud, which means they have to put their data on cloud, without other choices.

The DGX server

We also visited the Demo Room of Nvidia’s state-of-art hardware for auto-driving and deep learning. It was an effective meeting, and we learn a lot.

The car of auto-driving testing platform

I am standing before the NVIDIA logo