Last week I went to Nvidia Corporation of Santa Clara (California) with my colleagues to join a technical meeting about cutting-edge hardware and software of Deep Learning.
The new office building of NVIDIA
At first day, team leaders from Nvidia introduced their developing plan of new hardware and software. The new hardwares are about Tesla V100, NVLink, and HGX (next generation of DGX). And the softwares are about CUDA-9.2 NCCL-2.0, and TensorRT-3.0
Here are some notes about their introducing:
- The next generation of Tesla P4 GPU will have tensor-core, 16GB memory, and H264 decoder (performance as Tesla P100) for better inference performance, especially for image/video processing.
- The software support of tensor-core (mainly in Tesla V100 GPU) has been integrated in Tensorflow-1.5 version.
- The TensorRT could turn three layers of Deep Learning (Conv layer, Bias layer, Relu layer) to one CBR layer, eliminate concatenation layers, to accelerate inference computing.
- The tool ‘nvidia-smi’ could show ‘util’ of GPU. But ‘80%’ util only means this GPU run task (no matter how many CUDA-cores has been used) for 0.8 second in one second period. Therefore it’s not a accurate metrics for real GPU load. NVPROF is the much powerful and accurate tool for profiling of GPU
The TITAN V GPU
At second day, many teams from Alibaba (my company) ask Nvidia different questions. Here are some questions and answers:
Q: Some Deep Learning Compilers such as XLA (Google) and TVM(from AWS) could compile python code to GPU intermediate representation directly. How will Nvidia work with these application-oriented compiler?
A: The google XLA team will be shut off and move to optimize TPU performance only. Nvidia will still focus on library such as CUDA/cuDNN/TensorRT and will not build frameworks like Tensorflow or Mxnet.
Q: There are many new hardwares launched for Deep Learning: Google’s TPU, some ASICs developed by other companies. How will Nvidia keep cost performance over these new competitors?
A: ASICs are not programmable. If models of Deep Learning changes, the ASIC will be in trash. For example, TPU has Relu/Conv instructions, but if it comes new type of activation function, it will not work anymore. Furthermore, customers can only run TPU on Google’s cloud, which means they have to put their data on cloud, without other choices.
The DGX server
We also visited the Demo Room of Nvidia’s state-of-art hardware for auto-driving and deep learning. It was an effective meeting, and we learn a lot.
The car of auto-driving testing platform
I am standing before the NVIDIA logo