To check abnormal loss value when training a new model

Yesterday I wrote a Tensorflow program to train CIFAR100 dataset with Resnet-50 model. But when the training begin, I saw the ‘loss’ of classification is abnormally big and didn’t reduce at all:

Firstly, I thought the code for processing dataset may be wrong. But after print out the data in console, the loading input data seems all right. Then I print all the value of tensors right after initialization of model. And these value seems correct either.
Without other choices, I began to check the initializer in Tensorflow code:

If the loss is too big, maybe I could decrease the initial value of tensors in model? Then I change ‘mean’ from ‘0’ to ‘0.1’ for ‘slim.conv2d’:

But the loss seems more crazy:

I have to change ‘mean’ and ‘stddev’ again:

This time, the loss seems correct now.

This is the first time I saw that initialized value could make the training accuracy so different.

An example for running operation before fetching data in Tensorflow

In tensorflow, what should we do if we want run something before fetching data (such as, using queue in tensorflow)? Here is an example tested by myself:

It will print

Successfully, we add an operation before enqueue a item into queue.

Why my model doesn’t converge?

To use Resnet-50 to run CIFAR100 dataset, I wrote a program by using Tensorflow. But when running it, the loss seems keeping in about 4.5~4.6 forever:

After changed models (from Resnet to fully-connect-net), optimizers (from AdamOptimizer to AdagradOptimizer), and even learning rate (from 1e-3 to even 1e-7), the phenomena didn’t change at all.
Finally, I checked the loss and the output vector step by step, and found that the problem is not in model but dataset code:

Every batch of data have the same pictures and same labels! Than’t why the model didn’t converge. I should have used ‘i’ instead of ‘self.pos’ as index to fetch data and labels.

So in DeepLearning area, problems comes not only from models and hyper-parameters, but also dataset, or faulty codes…

Problem about using slim.batch_norm() of Tensorflow (second episode)

In previous article, I have found out the reason. But how to resolve it on Multi-GPU-Training is still a question. As the suggestion of this issue in github, I tried two way to fix the problem:

First, rewrite my Averaging-Gradients-Training to learn tf.slim.create_train_op():

But unfortunately, this didn’t work at all. The inference result was still a mess.

Then, another way, I use Asynchronous-Gradient-Training and tf.slim.create_train_op():

Now the inference works very well! And the training speed become a little bit faster than Averaging-Gradients-Training, for the Averaging Operation needs to wait multi gradients from multi GPUs.