GPU

A powerful tool to monitor details of Intel CPU

In the research of PCIE 3.0 versus PCIE 4.0, I became serious about the actual application scenario. What’s the real bandwidth between CPU and GPU when we are training a deep learning model?

Finally, I got this tool: pcm

After building it, I run “sudo ./bin/pcm” and got this:

Grateful that I can even see the IPC(Instructions Per Cycle), and L2/L3 hit ratio from this tool. But my most interesting metric is the PCIE bandwidth. Where is the PCIE bandwidth?

I tried “sudo bin/pcm-pcie” but it said my desktop CPU (i5-12400) is not supported:

The processor is not susceptible to Rogue Data Cache Load: yes
The processor supports enhanced IBRS                     : yes
Package thermal spec power: 65 Watt; Package minimum power: 0 Watt; Package maximum power: 0 Watt;

INFO: Linux perf interface to program uncore PMUs is present

For non-CSV mode delay < 1.0s does not make a lot of practical sense. Default delay 1s is used. Consider to use CSV mode for lower delay values
Update every 1 seconds

Detected 12th Gen Intel(R) Core(TM) i5-12400 "Intel(r) microarchitecture codename Alder Lake" stepping 5 microcode level 0x2c
Jaketown, Ivytown, Haswell, Broadwell-DE, Skylake, Icelake, Snowridge and Sapphirerapids Server CPU is required for this tool! Program aborted
Cleaning up
 Closed perf event handles
 Zeroed uncore PMU registers

Then a new idea jumped out of my mind: what my CPU do in my application is only read data from file and push them to GPU, so the bandwidth of reading memory is approximately the writing bandwidth of PCIE!

To verify my idea, I changed my model from “tf_efficientnetv2_s_in21k” to “tf_mobilenetv3_small_075” (using a smaller model could let CPU pump more data into GPU)

As we can see, the bandwidth of READ memory increased from “1.36GB” to “13.69GB”. This shall be equal to the bandwidth of PCIe (since the data from memory will only go to the GPU).

Seems we really need PCIE 4.0 for deep learning 🙂

Strange error from Nvidia’s apex library

apex is a mixed training library from Nvidia. I have been using it since I got an RTX3080TI GPU. A few days ago, I started to use RegNetY-32GF (I just used RegNetY models smaller than16GF previously). After a accidental break, I tried to resume the training but it reported:

Traceback (most recent call last):
  File "train.py", line 353, in <module>
    train(args, train_loader, eval_loader)
  File "train.py", line 220, in train
    scaled_loss.backward()
  File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 401, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/__init__.py", line 191, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.

import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([28, 3712, 10, 10], dtype=torch.half, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(3712, 3712, kernel_size=[1, 1], padding=[0, 0], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().half()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

ConvolutionParams 
    data_type = CUDNN_DATA_HALF
    padding = [0, 0, 0]
    stride = [1, 1, 0]
    dilation = [1, 1, 0]
    groups = 1
    deterministic = false
    allow_tf32 = true
input: TensorDescriptor 0x55d2a620ff60
    type = CUDNN_DATA_HALF
    nbDims = 4
    dimA = 28, 3712, 10, 10, 
    strideA = 371200, 100, 10, 1, 
output: TensorDescriptor 0x55d2a6215310
    type = CUDNN_DATA_HALF
    nbDims = 4
    dimA = 28, 3712, 10, 10, 
    strideA = 371200, 100, 10, 1, 
weight: FilterDescriptor 0x7fd9e806f1e0
    type = CUDNN_DATA_HALF
    tensor_format = CUDNN_TENSOR_NCHW
    nbDims = 4
    dimA = 3712, 3712, 1, 1, 
Pointer addresses: 
    input: 0x7fd73fde3a00
    output: 0x7fd746abb600
    weight: 0x7fd761b5de00

This error looks quite scary so the first thought that comes to my mind is that: the training environment has crashed! Then I download the newest GPU driver and get the most updated docker container for PyTorch. But the error continues.

As a second thought, I began to suspect that apex couldn’t handle too big models…(what’s in my mind?) therefore I modified my code to use “torch.cuda.amp” instead of “apex.amp” as the document. Fortunately, the error disappears but I have to use a smaller batch size. Looks like the “torch.cuda.amp” couldn’t reduce enough GPU memory as “apex.amp”.

However, the story doesn’t end here. Just before writing this article, I just used a smaller batch size for my old code with “torch.cuda.amp”, and it works well…

All in all, the terrible error above is simply caused by insufficient GPU memory.

Read paper “In-Datacenter Performance Analysis of a Tensor Processing Unit”

Paper reference: In-Datacenter Performance Analysis of a Tensor Processing Unit”
Application
Using floating point (16bit or 32bit) for NN (Neural Network) training, then a step called quantization transforms floating-point numbers into narrow integers–often just 8 bits–which are usually good enough for inference.
MLP(Multi-layer Perceptions), CNN(Convolutional Neural Netowrks), and RNN(Recurrent Neural Networks), these three types of NN represent 95% of NN inference workload in Google datacenter. Therefore, the TPU mainly focus on them.

As we can see, CNNs are usually dense-computing NN, which are better for TPU.

TPU has 25 times as many MACs (Multiply and Accumulate) and 3.5 times as much on-chip memory as the K80 GPU.
Architecture
The TPU was designed to be a coprocessor on the PCIe I/O bus, more like FPU(floating-poin unit) than it is to a GPU.

The parameters of NN model (weights) comes from off-chip memory (8G DDR3 DRAM) to Weight FIFO, and then flow into MMU(Matrix Multiply Unit). The request (sample need to be inference) comes from PCIe to Unified Buffer, and also flow into MMU finally.
Even the “Activation” and “Pooling” algorithm in CNN have been fixed into hardware.

The MMU contains 256×256 MACs that can perform 8-bit multiply-and-adds on signed or unsigned integers.

According to this Floor Plan, we can imaging that UB and MMU might cost most energy of TPU.

TPU instructions follow the CISC tradition and only has about a dozen instructions, include “Read_Host_Memory”, “Read_Weights”, “MatrixMultiply”, “Activate” etc. Recalling how many codes we need to write to implement a effective Activation function, then we could conceive the speed of using only one “Activate” instruction in TPU.
This paper said TPU is a type of Systolic Array. But what is Systolic Array? Here is the explain: A systolic array is a network of processors that rhythmically compute and pass data through the system.
Performance
There are lot of tables and diagrams which show the top-rate performance of TPU. Although the TPU is fast, it also depend on the computing-density of applications. The CNNs are most computing-dense NN, so it gains most speed(or TeraOps per second) from TPU:

In this paper, it didn’t explain why the GPU is slower than TPU in inference operation. The only sentence about this topic is in “8 Discussion”: “GPUs have traditionally been seen as high-throughput architectures that reply on high-bandwidth DRAM and thousands of threads to achieve their goals”. Actually, I think this is not a serious explain.
The interesting thing is, after Google publish this paper, the CEO of Nvidia – Jensen Huang, wrote a blog to gently appeal a fact: the state-of-the-art GPU (Tesla P40) can inference faster than TPU. The war between different giants of Deep learning is just beginning.

Google TPU from Hao(Robin) Dong

Use mxnet to classify images of birds (first episode)

Recently, I was trying to classify images of birds by using machine learning technology. The most familiar deep learning library for me is the mxnet, so I use its python interface to build my Birds-Classification-System.
For having not sufficient number of images for all kinds of bird, I just collect three types of them: “Loggerhead Shrike”, “Anhinga”, and “Eastern Meadowlark”.

Loggerhead Shrike

Anhinga

Eastern Meadowlark

After collecting more than 800 images of the three kinds of bird, I started to write my python code by learning the “Handwritten Digital Sample” of mxnet step by step.
Firstly, using PIL (Python Image Library) to preprocess these images – chop them from rectangle to square with 100 pixels length of edge:

edge = 100
......
def process_image(file_name):
    img = Image.open(file_name)
    tp = img.getbbox()
    width = tp[2]
    height = tp[3]
    if (width > height):
        sub = width - height
        img = img.crop((sub / 2, 0, height + sub / 2, height))
    elif (height > width):
        sub = height - width
        img = img.crop((0, sub / 2, width, width + sub / 2))
    img.thumbnail((edge, edge), Image.NEAREST)
    return img

Then put all images into a numpy array and label them:

images = np.array([])
labels = np.array([])
nr_images = 0
......
      for file in files:
        img = process_image(bird_dir + file)
        global images, labels, nr_images
        if array.shape == (edge, edge, 3):
          images = np.append(images, np.asarray(img))
          labels = np.append(labels, lab)
          nr_images = nr_images + 1

Now I can build the Convolutional Neural Network model easily by using the powerful mxnet. The CNN will slice all pictures to 8×8 pixels small chunk with 2 pixels step, therefore enhance the small features of these birds, such as black-eye-mask of Loggerhead-Shrike, yellow neck of Eastern-Meadowlark, etc.

def convolution_network():
    data = mx.sym.Variable('data')
    conv1 = mx.sym.Convolution(data=data, kernel=(8, 8), stride=(2,2), num_filter=8)
    bn1 = mx.sym.BatchNorm(data=conv1, fix_gamma=False, eps=2e-5, momentum=0.9, name="bn1")
    tanh1 = mx.sym.Activation(data=bn1, act_type="relu")
    flatten = mx.sym.Flatten(data=tanh1)
    fc1 = mx.sym.FullyConnected(data=flatten, num_hidden=1000)
    tanh3 = mx.sym.Activation(data=fc1, act_type="relu")
    fc3 = mx.sym.FullyConnected(data=tanh3, num_hidden=3)
    return mx.sym.SoftmaxOutput(data=fc3, name='softmax')

Training the data:

images = np.array(images).reshape(nr_images, 3, edge, edge).astype(np.float32)/255
batch_size = 200
train_iter = mx.io.NDArrayIter(images, labels, batch_size, shuffle=True)
mlp = convolution_network()
model = mx.model.FeedForward(
        ctx = mx.gpu(0),
        symbol = mlp,
        num_epoch = 40,
        learning_rate = 0.02
model.fit(
        X = train_iter,
        batch_end_callback = mx.callback.Speedometer(batch_size, 1))

Using GPU for training is extremely fast – it only cost me 5 minutes to train all 800 images, although adjusting the parameters of CNN cost me more than 3 days 🙂

Firstly I use Fully Connected Neural Network, it costs a lot of time for training but prone to overfit. After using the CNN with BatchNorm() in mxnet, the speed of training and affect of classification advanced significantly.
CNN(Convolutional Neural Network) is really a ace in deep learning area for images!

A CUDA program to test performance of GPU

For testing performance of our Nvidia GPU, I have to write my first CUDA program to mutiply two Vectors with each size of 2GB:

#include 
#include 
#include 
#include 
size_t LOOP = 10;
const size_t COLUMNS = 512 * 1048576;
const size_t BUFF_LEN = 4 * COLUMNS;
__global__ void VecMul(float *A, float *B, float *C, float *total) {
  int i = threadIdx.x;
  C[i] = A[i] * B[i];
}
float test_cuda(float *left, float *right, float *result, size_t count) {
  float total;
  float *left_d, *right_d, *result_d;
  struct timeval before, after, c_before, c_after;
  int i, error;
  error = cudaMalloc((void**) &left_d, BUFF_LEN);
  if (error != cudaSuccess) {
    printf("Failed to malloc left_d!\n");
    exit(1);
  }
  error = cudaMalloc((void**) &right_d, BUFF_LEN);
  if (error != cudaSuccess) {
    printf("Failed to malloc right_d!\n");
    exit(1);
  }
  error = cudaMalloc((void**) &result_d, BUFF_LEN);
  if (error != cudaSuccess) {
    printf("Failed to malloc result_d!\n");
    exit(1);
  }
  gettimeofday(&before, NULL);
  cudaMemcpy(left_d, left, BUFF_LEN, cudaMemcpyHostToDevice);
  cudaMemcpy(right_d, right, BUFF_LEN, cudaMemcpyHostToDevice);
  gettimeofday(&c_before, NULL);
  for (i = 0; i < LOOP; i++) {
    VecMul<<<1, COLUMNS>>>(left_d, right_d, result_d, &total);
  }
  gettimeofday(&c_after, NULL);
  cudaMemcpy(result, result_d, BUFF_LEN, cudaMemcpyDeviceToHost);
  gettimeofday(&after, NULL);
  printf("CUDA compute:\t%lu\n", c_after.tv_usec + c_after.tv_sec * 1000000 -
         (c_before.tv_usec + c_before.tv_sec * 1000000));
  printf("CUDA:\t%lu\n", after.tv_usec + after.tv_sec * 1000000 -
         (before.tv_usec + before.tv_sec * 1000000));
  for (i = 0; i < 4; i++) {
    printf("[Sample: %f]\n", result[i]);
  }
  for (i = COLUMNS - 4; i < COLUMNS; i++) {
    printf("[Sample: %f]\n", result[i]);
  }
  cudaFree(left_d);
  cudaFree(right_d);
  cudaFree(result_d);
  return total;
}
int main(int argc, char *argv[]) {
  float *left = (float*)_mm_malloc(BUFF_LEN, 32);
  float *right = (float*)_mm_malloc(BUFF_LEN, 32);
  float *result = (float*)_mm_malloc(BUFF_LEN, 32);
  size_t count = BUFF_LEN / sizeof(float);
  int i;
  if (argc > 1) {
      LOOP = atol(argv[1]);
  }
  for (i = 0; i < count; i++) {
    left[i] = 1.23456;
    right[i] = 1.23456;
    result[i] = 1.23456;
  }
  test_cuda(left, right, result, count);
  free(left);
  free(right);
}

Luckily, it works :)
The cudaMemcpy() cost about 1 second, but the multiplication of two Vectors cost only 80 micro seconds (even with 10 LOOP as default). Therefore I reckon GPU is perfect for training of Machine Learning, but not promising for predicting when Model has been built.

Note: Use cudaMalloc()/cudaMemcpy() instead of malloc()/memcpy() in Standard C Library, or else the program will not run VecMul<<<>>>

Robin on Linux

GPU