Nvidia

Install new driver for old Nvidia Tesla P100

I was trying to launch a VM instance with GPU on Google Cloud. But after trying T4, L4, and V100, they all reported “exceeding resource limit”, which means a lot of people in my region are using these types of GPUs.

Without choice, I launched a VM instance with an old Nvidia Tesla P100 (I first used it about 5 years ago). Then, I need to install its driver. But the installation process reported errors:

   *** Failed CC version check. ***

     SYMLINK /tmp/selfgz26389/NVIDIA-Linux-x86_64-515.105.01/kernel/nvidia/nv-kernel.o
     SYMLINK /tmp/selfgz26389/NVIDIA-Linux-x86_64-515.105.01/kernel/nvidia-modeset/nv-modeset-kernel.o
    CONFTEST: hash__remap_4k_pfn
    CONFTEST: set_pages_uc
    CONFTEST: list_is_first
    CONFTEST: set_memory_uc
...
	cc: error: unrecognized command-line option '-ftrivial-auto-var-init=zero'

At first glance, I suspect the GCC compiler is too old. After downgrading the GCC to gcc-10 and gcc-9, the error still existed.

Finally, I noticed that the driver of the Tesla P100 is very new (Release Date: 2023.3.30) and this page mentioned “gcc-12”. Therefore I upgraded the GCC to 12:

sudo apt install gcc-12
sudo ln -sf /usr/bin/gcc-12 /etc/alternatives/cc

Now the driver can be installed successfully.

Accelerate the speed of data loading in PyTorch

I got a desktop computer to train deep learning model last week. The GPU is GTX1050TI with 4GB memory which is enough for basic training on object detection. But the CPU is too old. Therefore when I run the training process, the idle of CPU is 0%. I need to reduce the burden of it.
I tried DALI from Nvidia. Although confessed that it is powerful, I also noticed that DALI is too specific to be used for customer dataset. For example, if I want to use complicate label structures more than just ‘bounding box’ coordinates, I can’t find any code example to in DALI to meet this requirement. By the way, the GPU memory in my computer is not big enough, so if moved computation burden from CPU to GPU, I would have to reduce batch size for training. That’s not a good option too.
Yesterday, from this post, I saw this suggestion:

You can use jpeg4py, library dedicated to encode big jpeg files much faster than PIL. Just read image using this library, then transform it to PIL.

After changed my code from using ‘cv.imread()’ to jpeg4py.JPEG().decode()’, the average training time for 1000 batches in my mode boosted improved 700 seconds to 670 seconds. This is just the simple and useful solution I need.

Technical Meeting with Nvidia Corporation

Last week I went to Nvidia Corporation of Santa Clara (California) with my colleagues to join a technical meeting about cutting-edge hardware and software of Deep Learning.

The new office building of NVIDIA

At the first day, team leaders from Nvidia introduced their developing plan of new hardware and software. The new hardware are about Tesla V100, NVLink, and HGX (next generation of DGX). And the software is about CUDA-9.2 NCCL-2.0 and TensorRT-3.0
Here are some notes about their introducing:

The next generation of Tesla P4 GPU will have tensor-core, 16GB memory, and H264 decoder (performance as Tesla P100) for better inference performance, especially for image/video processing.
The software support of tensor-core (mainly in Tesla V100 GPU) has been integrated into Tensorflow-1.5 version.
The TensorRT could turn three layers of Deep Learning (Conv layer, Bias layer, Relu layer) to one CBR layer, eliminate concatenation layers, to accelerate inference computing.
The tool ‘nvidia-smi’ could show ‘util’ of GPU. But ‘80%’ utility only means this GPU run task (no matter how many CUDA-cores has been used) for 0.8 seconds in one second period. Therefore it’s not an accurate metrics for real GPU load. NVPROF is the much powerful and accurate tool for profiling of GPU

The TITAN V GPU

At the second day, many teams from Alibaba (my company) ask Nvidia different questions. Here are some questions and answers:

Q: Some Deep Learning Compilers such as XLA (Google) and TVM(from AWS) could compile python code to GPU intermediate representation directly. How will Nvidia work with these application-oriented compilers?
A: The google XLA team will be shut off and move to optimize TPU performance only. Nvidia will still focus on a library such as CUDA/cuDNN/TensorRT and will not build frameworks like Tensorflow or Mxnet.

Q: There are many new types of hardware launched for Deep Learning: Google’s TPU, some ASICs developed by other companies. How will Nvidia keep cost performance over these new competitors?
A: ASICs are not programmable. If models of Deep Learning changes, the ASIC will be in the trash. For example, TPU has Relu/Conv instructions, but if it comes to a new type of activation function, it will not work anymore. Furthermore, customers can only run TPU on Google’s cloud, which means they have to put their data on the cloud, without other choices.

The DGX server

We also visited the Demo Room of Nvidia’s state-of-art hardware for auto-driving and deep learning. It was an effective meeting, and we learn a lot.

The car of auto-driving testing platform

I am standing before the NVIDIA logo

A CUDA program to test performance of GPU

For testing performance of our Nvidia GPU, I have to write my first CUDA program to mutiply two Vectors with each size of 2GB:

#include 
#include 
#include 
#include 
size_t LOOP = 10;
const size_t COLUMNS = 512 * 1048576;
const size_t BUFF_LEN = 4 * COLUMNS;
__global__ void VecMul(float *A, float *B, float *C, float *total) {
  int i = threadIdx.x;
  C[i] = A[i] * B[i];
}
float test_cuda(float *left, float *right, float *result, size_t count) {
  float total;
  float *left_d, *right_d, *result_d;
  struct timeval before, after, c_before, c_after;
  int i, error;
  error = cudaMalloc((void**) &left_d, BUFF_LEN);
  if (error != cudaSuccess) {
    printf("Failed to malloc left_d!\n");
    exit(1);
  }
  error = cudaMalloc((void**) &right_d, BUFF_LEN);
  if (error != cudaSuccess) {
    printf("Failed to malloc right_d!\n");
    exit(1);
  }
  error = cudaMalloc((void**) &result_d, BUFF_LEN);
  if (error != cudaSuccess) {
    printf("Failed to malloc result_d!\n");
    exit(1);
  }
  gettimeofday(&before, NULL);
  cudaMemcpy(left_d, left, BUFF_LEN, cudaMemcpyHostToDevice);
  cudaMemcpy(right_d, right, BUFF_LEN, cudaMemcpyHostToDevice);
  gettimeofday(&c_before, NULL);
  for (i = 0; i < LOOP; i++) {
    VecMul<<<1, COLUMNS>>>(left_d, right_d, result_d, &total);
  }
  gettimeofday(&c_after, NULL);
  cudaMemcpy(result, result_d, BUFF_LEN, cudaMemcpyDeviceToHost);
  gettimeofday(&after, NULL);
  printf("CUDA compute:\t%lu\n", c_after.tv_usec + c_after.tv_sec * 1000000 -
         (c_before.tv_usec + c_before.tv_sec * 1000000));
  printf("CUDA:\t%lu\n", after.tv_usec + after.tv_sec * 1000000 -
         (before.tv_usec + before.tv_sec * 1000000));
  for (i = 0; i < 4; i++) {
    printf("[Sample: %f]\n", result[i]);
  }
  for (i = COLUMNS - 4; i < COLUMNS; i++) {
    printf("[Sample: %f]\n", result[i]);
  }
  cudaFree(left_d);
  cudaFree(right_d);
  cudaFree(result_d);
  return total;
}
int main(int argc, char *argv[]) {
  float *left = (float*)_mm_malloc(BUFF_LEN, 32);
  float *right = (float*)_mm_malloc(BUFF_LEN, 32);
  float *result = (float*)_mm_malloc(BUFF_LEN, 32);
  size_t count = BUFF_LEN / sizeof(float);
  int i;
  if (argc > 1) {
      LOOP = atol(argv[1]);
  }
  for (i = 0; i < count; i++) {
    left[i] = 1.23456;
    right[i] = 1.23456;
    result[i] = 1.23456;
  }
  test_cuda(left, right, result, count);
  free(left);
  free(right);
}

Luckily, it works :)
The cudaMemcpy() cost about 1 second, but the multiplication of two Vectors cost only 80 micro seconds (even with 10 LOOP as default). Therefore I reckon GPU is perfect for training of Machine Learning, but not promising for predicting when Model has been built.

Note: Use cudaMalloc()/cudaMemcpy() instead of malloc()/memcpy() in Standard C Library, or else the program will not run VecMul<<<>>>

Robin on Linux

Nvidia