After my old colleague, JianMei prepared about 1TB data of the birds’ sound records (every mp3 file will be transferred to an image by using spectrogram and split into chunks with each chunk 2.5 seconds period. After all, every file is a 1250×78 multi-dimension array), I started training with almost the same code using in bird image classification.

The train-accuracy rises very slowly so I add a line of code to normalize every input sample:

image = (image - image.mean()) / image.std()

After that, the train-accuracy could rise faster, but the eval-accuracy still quite low.

In order to find out the root of the problem, I started to train from only two classes: “Black-capped Donacobius” and “Blue-eared Barbet”

Eval accuracy:0.965278 | Train accuracy:1.000000

The result seems pretty good. So I increate the number of classes to 20

Eval accuracy:0.211102 | Train accuracy:0.840278
Eval accuracy:0.203834 | Train accuracy:0.885417
Eval accuracy:0.191514 | Train accuracy:0.904247
Eval accuracy:0.193245 | Train accuracy:0.916667
Eval accuracy:0.210894 | Train accuracy:0.916667
Eval accuracy:0.190269 | Train accuracy:0.932292

Are some types of bird hard to generalize in deep learning model? Then I began to consider how to find these “hard to train” bird type: maybe start from 2 classes and increase the number of classes step by step, and then draw a few curves about the train-accuracy and eval-accuracy…

Suddenly I realized that I just use normalization in the training sample but not evaluation sample!

What a stupid mistake. It wasted me a whole stuffy afternoon for nothing. I really should remember this lesson: do what you do in training samples to evaluation samples, except dropout.