Technical Meeting with Nvidia Corporation

Last week I went to Nvidia Corporation of Santa Clara (California) with my colleagues to join a technical meeting about cutting-edge hardware and software of Deep Learning.



The new office building of NVIDIA

At first day, team leaders from Nvidia introduced their developing plan of new hardware and software. The new hardwares are about Tesla V100, NVLink, and HGX (next generation of DGX). And the softwares are about CUDA-9.2 NCCL-2.0, and TensorRT-3.0

Here are some notes about their introducing:

  • The next generation of Tesla P4 GPU will have tensor-core, 16GB memory, and H264 decoder (performance as Tesla P100) for better inference performance, especially for image/video processing.
  • The software support of tensor-core (mainly in Tesla V100 GPU) has been integrated in Tensorflow-1.5 version.
  • The TensorRT could turn three layers of Deep Learning (Conv layer, Bias layer, Relu layer) to one CBR layer, eliminate concatenation layers, to accelerate inference computing.
  • The tool ‘nvidia-smi’ could show ‘util’ of GPU. But ‘80%’ util only means this GPU run task (no matter how many CUDA-cores has been used) for 0.8 second in one second period. Therefore it’s not a accurate metrics for real GPU load. NVPROF is the much powerful and accurate tool for profiling of GPU



The TITAN V GPU

At second day, many teams from Alibaba (my company) ask Nvidia different questions. Here are some questions and answers:

Q: Some Deep Learning Compilers such as XLA (Google) and TVM(from AWS) could compile python code to GPU intermediate representation directly. How will Nvidia work with these application-oriented compiler?

A: The google XLA team will be shut off and move to optimize TPU performance only. Nvidia will still focus on library such as CUDA/cuDNN/TensorRT and will not build frameworks like Tensorflow or Mxnet.

Q: There are many new hardwares launched for Deep Learning: Google’s TPU, some ASICs developed by other companies. How will Nvidia keep cost performance over these new competitors?

A: ASICs are not programmable. If models of Deep Learning changes, the ASIC will be in trash. For example, TPU has Relu/Conv instructions, but if it comes new type of activation function, it will not work anymore. Furthermore, customers can only run TPU on Google’s cloud, which means they have to put their data on cloud, without other choices.



The DGX server

We also visited the Demo Room of Nvidia’s state-of-art hardware for auto-driving and deep learning. It was an effective meeting, and we learn a lot.



The car of auto-driving testing platform



I am standing before the NVIDIA logo

Sharing Variables in Tensorflow

This article shows how to use sharing variables in Tensroflow. But I still have a question: dose sharing variables have the same value? To answer this question, I write these code below:

Therefore, the “sharing variables” mechanism is made only for convenience of writing short code to create multi-models. For sharing same value for different variables, we still need ‘assign’ operation.

Some tips about Tensorflow

Q: How to fix error report like

A: We can’t feed a value into a variable and optimize it in the same time (So the problem only occurs when using Optimizers). Should using ‘tf.assign()’ in graph to give value to tf.Variable

Q: How to get a tensor by name?

A: like this:

Q: How to get variable by name?

A:

How to average gradients in Tensorflow

Sometimes, we need to average an array of gradients in deep learning model. Fortunately, Tensorflow divided models into fine-grained tensors and operations, therefore it’s not difficult to implement gradients average by using it.

Let’s see the code from github

We should keep in mind that these codes will only build a static graph (the ‘grads; are references rather than values).

First, we need to expand dimensions of tensor(gradient) and concatenate them. Then use reduce_mean() to do actually average operation (seems not intuitive).

A basic example of using Tensorflow to regress

In theory of Deep Learning, even a network with single hidden layer could represent any function of mathematics. To verify it, I write a Tensorflow example as below:

In this code, it was trying to regress to a number from its own sine-value and cosine-value.
At first running, the loss didn’t change at all. After I changed learning rate from 1e-3 to 1e-5, the loss slowly went down as normal. I think this is why someone call Deep Learning a “Black Magic” in Machine Learning area.

“Eager Mode” in Tensorflow

Although Tensorflow is the most popular Deep Learning Framework in 2016, Pytorch, a smaller new framework developed by FAIR(Facebook AI Research), become a dark horse this year. Pytorch supports Dynamic Graph Computing, which means you can freely add or remove layers in your model at runtime. It makes developer or scientist build new models more rapidly.
To fight back Pytorch, Tensorflow team add a new mechanism named “Eager Mode”, in which we could also use Dynamic Graph Computing. The example of “Eager Mode” looks like:

As above, unlike traditional Tensorflow application that use “Session.run()” to execute whole graph, developers could see values and gradients of variables in any layer at any step.

How did Tensorflow do it? Actually, the tricks behind the API is not difficult. Take the most common Operation ‘matmul’ as example:

Le’t look into “gen_math_ops._mat_mul()”:

As we can see, in Graph Mode, it will go to “_apply_op_helper()” to build graph (but not running it). In Eager Mode, it will execute the Operation directly.

Training DNN with less memory cost

The paper “Training Deep Nets with Sublinear Memory Cost” tells us a practical method to train DNN with far less memory cost. The mechanism behind is not difficult to understand: when training a deep network (a computing graph), we have to store temporary data in every node, which will occupy extra memory. Actually, we could remove these temporary data after computing each node, and compute them again in back-propagation period. It’s a tradeoff between computing time and computing space.

The author give us an example in MXNET. The improvement of memory-reducing seems tremendous.

Above the version 1.3, tensorflow also brought a similar module: memory optimizer. We can use it like this:

Still need to add op in Resnet:

By using this method, we could increase batch-size even in deep network (Resnet-101 etc.) now.