Why my model doesn’t converge?

To use Resnet-50 to run CIFAR100 dataset, I wrote a program by using Tensorflow. But when running it, the loss seems keeping in about 4.5~4.6 forever:

step: 199, loss: 4.61291, accuracy: 0
step: 200, loss: 4.60952, accuracy: 0
step: 201, loss: 4.60763, accuracy: 0
step: 202, loss: 4.62495, accuracy: 0
step: 203, loss: 4.62312, accuracy: 0
step: 204, loss: 4.60703, accuracy: 0
step: 205, loss: 4.60947, accuracy: 0
step: 206, loss: 4.59816, accuracy: 0
step: 207, loss: 4.62643, accuracy: 0
step: 208, loss: 4.59422, accuracy: 0
...

After changed models (from Resnet to fully-connect-net), optimizers (from AdamOptimizer to AdagradOptimizer), and even learning rate (from 1e-3 to even 1e-7), the phenomena didn’t change at all.
Finally, I checked the loss and the output vector step by step, and found that the problem is not in model but dataset code:

    def next_batch(self, batch_size = 64):
        images = []
        labels = []
        for i in range(self.pos, self.pos + batch_size):
            image = self.data['data'][self.pos]
            image = image.reshape(3, 32, 32)
            image = image.transpose(1, 2, 0)
            image = image.astype(np.float32) / 255.0
            images.append(image)
            label = self.data['fine_labels'][self.pos]
            labels.append(label)
        if (self.pos + batch_size) >= CIFAR100_TRAIN_SAMPLES:
            self.pos = 0
        else:
            self.pos = self.pos + batch_size
        return [images, labels]

Every batch of data have the same pictures and same labels! Than’t why the model didn’t converge. I should have used ‘i’ instead of ‘self.pos’ as index to fetch data and labels.
So in DeepLearning area, problems comes not only from models and hyper-parameters, but also dataset, or faulty codes…

Robin on Linux

Why my model doesn’t converge?

Leave a Reply Cancel reply

Robin on Linux

Related Posts

Leave a Reply Cancel reply