Deep Learning

A strange problem in RegNetY-32G

I have been using RegNetY in DongNiao for almost two years. Previously I was just using small models such as RegNetY-8G. But after having a computer with RTX-3080-TI, I started to use the biggest one in the original paper — RegNetY-32G.

RegNeyY-32G model costs a lot of time for training so I would use mixed-precision in the process. However, after using “float16”, the training program always crashes with the error of overflow:

...
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.0237e-320                      
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.012e-320                       
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 5.06e-321                                                                                                                                 
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.53e-321                        
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.265e-321                       
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 6.3e-322                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 3.16e-322                        
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.6e-322                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8e-323                           
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4e-323                           
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2e-323                           
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1e-323                           
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 5e-324                           
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.0

Firstly, I suspected that the bigger model couldn’t hold a large learning rate (I used 8.0 for a long time) with “float16” training. So I reduced the learning rate to just 1e-1. The model stopped to report overflow error but the loss couldn’t converge and just stay constantly at about 9.

Then I have no choice but to adjust the parameters step by step to find a set of hyper-parameters for converging. Finally, I found the reason: the enabling of Squeeze-and-Excitation block in RegNetY makes the model harder to converge. The exponential operation in the Sigmoid function might be the cause since “float16” can’t always process exponential change properly.

The solution is simple: just disable the Squeeze-and-Excitation block in RegNetY:

    cfg.MODEL.TYPE = "regnet"
    # RegNetY-32.0GF
    cfg.REGNET.DEPTH = 20
    cfg.REGNET.SE_ON = False
    cfg.REGNET.W0 = 232
    cfg.REGNET.WA = 115.89
    cfg.REGNET.WM = 2.53
    cfg.REGNET.GROUP_W = 232
    cfg.BN.NUM_GROUPS = 4
    cfg.MODEL.NUM_CLASSES = config["num_classes"]
    net = model_builder.build_model()

I may need to use Hard Sigmoid in the Squeeze-and-Excitation block for the experiment in the future.

Finding problem about ‘Nan’ result in model training

Intending to use distilling for training my model. The Plan is:

Train model A and model B with same code and same dataset
Predict the dataset with model A and model B, and store the average of their result
Use the average prediction as the target of a new training process

Step 1 and Step 2 are successful. But when I run the new training process, it will report the loss as “Nan” after some steps.

To find out the reason, I started to print all the “average prediction results” for every step. At first, they look just as normal, but after a while, I find out that some input has “Nan”.

Why there is “Nan” in the “average prediction results”? I guess the reason is: some samples are too rare (or special) so the model will give an unreliable output. It’s quite easy to just ignore them:

if np.isnan(label).any() or not np.isfinite(label).all():
  # Drop the corresponding sample
  return None

Now the distilling training could go on.

Understanding Transformer

In the paper Attention Is All You Need, the Transformer neural network had been introduced for the first time in 2017. One year later, the BERT appeared. And last year I gave a simple presentation in my previous company about the Transformer and BERT. As showed below:

Transformer and BERT from Hao(Robin) Dong

A couple of days before I started to review the Transformer paper and found out that I need to recommend the article The Illustrated Transformer again. This article really helps me to understand a lot of details in the Transformer.

But there is still a question jump out of my brain: what’s the use of decoder in Transformer? How the information flows through encoder to decoder ? After thinking for quite a while, I figured it out: Transformer was used for Machine Translation task at the first place. The encoder is used to “transform” sentence of source language to a couple of Keys and Values; the decoder will “transform” a word of target language to a Query. By using a Query and a couple of Keys and Values, it could get a vector, which is actually the embedding of next word in target language.

Here is a digram draw by me. Hope it could explain my own confusion.

“Ich bin ein guter Kerl” in German means “I am a good guy”. By encoding all German words to a couple of Keys and Values, and decode “good” to a Query, the Transformer could finally output the embedding vector of “guy”.

Use mxnet to classify images of birds (third episode)

After using CNN in previous article, it still can’t recognize the correct name of birds if the little creature stand on the corner (instead of the center) of the whole picture. Then I started to think about the problem: how to let neural-network ignore the position of the bird in picture, but only focus on its exists? Eventually I recollected the “max pooling”:

From: http://mxnet.io/tutorials/python/mnist.html

By choose the max feature value from 2×2 pad, it will amplify the most important feature without affected by backgrounds. For example, if we split a picture into 2×2 chassis (4 plates) and the bird only stand in the first plate, the “max pooling” will choose only the first plate for next processing. Those trees, pools, leaves and other trivial issues in other three plates will be omitted.
Then I modify the structure of CNN again:

def convolution_network():
    data = mx.sym.Variable('data')
    conv1 = mx.sym.Convolution(data=data, kernel=(12, 12), stride=(5, 5), num_filter=128)
    bn1 = mx.sym.BatchNorm(data=conv1, fix_gamma=False, eps=2e-5, momentum=0.9, name="bn1")
    tanh1 = mx.sym.Activation(data=bn1, act_type="relu")
    pool1 = mx.sym.Pooling(data=tanh1, pool_type="max", kernel=(2,2), stride=(2,2))
    conv2 = mx.sym.Convolution(data=pool1, kernel=(12, 12), stride=(5, 5), num_filter=128)
    bn2 = mx.sym.BatchNorm(data=conv2, fix_gamma=False, eps=2e-5, momentum=0.9, name="bn2")
    tanh2 = mx.sym.Activation(data=bn2, act_type="relu")
    pool2 = mx.sym.Pooling(data=tanh2, pool_type="max", kernel=(2,2), stride=(2,2))
    fc3 = mx.sym.FullyConnected(data=pool2, num_hidden=3)
    return mx.sym.SoftmaxOutput(data=fc3, name='softmax')

and using “0.3” for my learning rate, as “0.3” is better to against overfitting.
For one week (Chinese New Year Festival), I was studying “Neural Networks and Deep Learning”. This book is briefly awesome! A lot of doubts about Neural Networks for me have been explained and resolved. In third chapter, the author Michael Nielsen suggests a method, which really enlightened me, to defeat overfitting: artificially expanding training data. The example is rotating the MNIST handwritten digital picture by 15 degrees:

In my case, I decided to crop different parts of bird picture if the picture is a rectangle:

by using the python PIL (Picture Processing Library):

def crop_image(origin, imgs, box):
  result = origin.crop(box)
  result.thumbnail((edge, edge), Image.NEAREST)
  imgs.append(result)
def crop_and_append_image(img, imgs):
    tp = img.getbbox()
    width = tp[2]
    height = tp[3]
    if (width > height):
        sub = width - height
        crop_image(img, imgs, (sub / 2, 0, height + sub / 2, height))
        if (sub >= 80):
          crop_image(img, imgs, (sub / 2 - 40, 0, height + sub / 2 - 40, height))
          crop_image(img, imgs, (sub / 2 + 40, 0, height + sub / 2 + 40, height))
    elif (height > width):
        sub = height - width
        crop_image(img, imgs, (0, sub / 2, width, width + sub / 2))
        if (sub >= 80):
          crop_image(img, imgs, (0, sub / 2 - 40, width, width + sub / 2 - 40))
          crop_image(img, imgs, (0, sub / 2 + 40, width, width + sub / 2 + 40))
    else:
      img.thumbnail((edge, edge), Image.NEAREST)
      imgs.append(img)

The effect of using “max pooling” and “expanding training data” is significant:

Robin on Linux

Deep Learning