## Using keras.layers.Embedding instead of python dictionary

Firstly, I use a function to transform words into word-embedding:

But I noticed that it costs quite a few CPU resource while GPU usage is still low. The reason is simple: using single thread python to do search in dictionary is uneffective. We should use Embedding layer in Keras to put all word-embedding-table into GPU memory.
The code is not difficult to understand:

This time, the program run two times faster than before. Using GPU memory (GDDR) to find word embedding is the right way.

## A few other lessons from Kaggle’s competition ‘Human Protein Atlas Image Classification’

Practice makes progress. Therefore I continued to join Kaggle’s new competition ‘Human Protein Atlas Image Classification’ after the previous one.
I used think I could get a higher rating in image processing competition. But actually, I haven’t even entered the top half of rankings. After almost three month trials and errors, here are my rethinkings:

1. To solve the unbalanced data problem, we need to use ‘focal loss’ instead of normal cross entropy loss. I should be looking at other experts’ kernels earlier, then I could use new techniques as soon as possible.

2. To augment images, ‘lower resolution’ may be a better way than ‘mix up’

3. Try SGD and Cosine Decay, not only RMSProp

4. MobileNet may cause severe overfitting than Resnet

5. If dropout and weight-decay still can’t get better affection for regularization, what should we do? (An open question, feature engineering may be the answer)

6. Use more powerful DNN framework, such as Keras, so I can spend more time on the model itself

## Some errors in dataset pipeline of Tensorflow

To extend image datasets by using mixup，I use this snippet to mix two images:

But after generating images by using this snippet, the training report errors:

The size of each image is 512x512x4 = 1048576 bytes. But I can’t understand why there is image has the size of 8388608 bytes.
Firstly my suspected point is the dataset flow of Tensorflow. But after changing the code of dataset pipeline, I find the problem is not in Tensorflow.
Again and again, I reviewed my code of generating new images and also adding some debug stub. Finally, I found out the problem: it’s not Tensorflow’s fault, but mine.
By using

The type of ‘new_image’ is ‘float64’, not ‘uint8’ for ‘major_image’ and ‘minor_image’! The ‘float64’ use 8 bytes to store one element, so this explains the ‘8388608’ in error information.
To correctly mixup images, the code should be:

## Books I read in year 2018

In the 2018 year, I continued to learn more knowledge about machine learning and deep Learning. “Deep Learning” is pretty suitable for me and “Hands-On Machine Learning with Scikit-Learn and TensorFlow” is also a wonderful supplement for programming practice. I also learned some basic knowledge about Reinforcement learning.

To teach my daughters programming, I read some books about Arduino. In the process of learning Arduino, I became more and more interested in electronics on myself! After reading more technical documents about electronics (diode, transistor, capacitor, relay, thyristor etc.), Microcontrollers (Atmega from Atmel, MSP430 from Texas Instruments, STM8 from ST and so on), I had opened my view to a new area.

History books are always my favorite type. The most astonishing history book I have read in 2018 is “The Last Panther”. This book tells us an extremely cruel but real story in WWII.

Kazuo Inamori is a famous entrepreneur in Japan. I read some books written by him at the end of this year. Surprisingly, his books definitely inspired me and even changed some parts of my mind. I really want to thank him for his teaching.

## Write text to file with disabling buffer in Python3

In Python2 era, we could use these code to write the file without buffer:

But in Python3 we can only write binary file by disabling buffer:

The only way to write text file without buffering is:

Adding ‘flush()’ everywhere is a terrible experience for a programmer who need to migrate his code from Python2 to Python3. I really want to know: what’s in Python3’s developers mind ?

## A successful rescue for a remote server

After installed CUDA-9.2 on a remote server, I found that the system can’t load nvidia.ko (kernel module) with dmesg:

The reason is the current kernel running on my system has turned on the CONFIG_CC_STACKPROTECTOR compiler option. Therefore I change the default entry of grub2 and reboot the server, for entering a new kernel without this option.
But unfortunately, the server never start up again. All my code and data (includes my colleague’s code and data) are on this server, so we get a little nervous then.

Since the server is in a remote datacenter, we can’t just plugin in a keyboard and a screen to debug. Thus I use the out-of-bound system to reboot this server to diskless-mode. After entering this mode, I mount the disk for ‘/boot’ directory:

and manually change the ‘/boot/grub2/grubenv’ like this (the ‘save_entry’ is 2 before):

Then reboot the server again. This time, the server started up smoothly now. All our code and data is untainted.

## Some tips about python this week

List of lists in python
Created a list of lists by using multiply symbol:

It’s weird that adding one item to first list have side-effect on second list! Seems ‘* 2’ makes two reference for one list.
How to avoid this then? The answer is using normal syntax:

A difference between python2 and python3
There are many trivial differences between python3 and python3, and I luckily found one this week:

In python3, the values() of a dictionary is not type of ‘list’ but ‘dict_values’. To be compatible to python2, we need to add

for old python2 codes.

## Compare implementation of tf.AdamOptimizer to its paper

When I reviewed the implementation of Adam optimizer in tensorflow yesterday, I noticed that it’s code is different from the formulas that I saw in Adam’s paper. In tensorflow’s formulas for Adam are:

But the algorithm in the paper is:

Then quickly I found these words in the document of tf.AdamOptimizer:

Note that since AdamOptimizer uses the formulation just before Section 2.1 of the Kingma and Ba paper rather than the formulation in Algorithm 1, the “epsilon” referred to here is “epsilon hat” in the paper.

And this time I did find the ‘Algo 2’ in the paper:

But how does ‘Algo 1’ tranform to ‘Algo 2’? Let me try to deduce them from ‘Algo 1’:

$\theta_t \gets \theta_{t-1} - \frac{\alpha \cdot \hat{m_t}}{(\sqrt{\hat{v_t}} + \epsilon)}$
$\implies \theta_t \gets \theta_{t-1} - \alpha \cdot \frac{m_t}{1 - \beta_1^t} \cdot \frac{1}{(\sqrt{\hat{v_t}} + \epsilon)} \quad \text{ (put } \hat{m_t} \text{ in) }$
$\implies \theta_t \gets \theta_{t-1} - \alpha \cdot \frac{m_t}{1 - \beta_1^t} \cdot \frac{\sqrt{1-\beta_2^t}}{\sqrt{v_t}} \quad \text{ (put } \hat{v_t} \text{ in and ignore } \epsilon \text{) }$
$\implies \theta_t \gets \theta_{t-1} - \alpha_t \cdot \frac{m_t}{\sqrt{v_t} + \hat{\epsilon}} \quad \text{add new } \hat{\epsilon} \text { to avoid zero-divide}$

## The bug about using hooks and MirroredStrategy in tf.estimator.Estimator

When I was using MirroedStrategy in my tf.estimator.Estimator:

Without finding any answers on google, I have to look into the code of ‘estimator.py’ in tensorflow. Fortunately, the code defect is obvious:

class Estimator havn’t any private argument named ‘_distribution’ but only have ‘_train_distribution’ and ‘_eval_distribution’. So the fix is just change ‘self._distribution.unwrap(per_device_hook)[0]’ to ‘self._train_distribution.unwrap(per_device_hook)[0]’.

I had submitted a request pull for tensorflow to fix this bug in branch 1.11

## Some lessons from Kaggle’s competition

About two months ago, I joined the competition of ‘RSNA Pneumonia Detection’ in Kaggle. It’s ended yesterday, but I still have many experiences and lessons to be rethinking.

1. Augmentation is extremely crucial. After using tf.image.sample_distorted_bounding_box() in my program, the mAP(mean Average Precision) of evaluating dataset thrived to a perfect number. Then I realised that I should have used radical augmentation method in the first place. Otherwise, for machine learning job such as image detection and image classification, the number of samples is only about tens of thousands which is quite small for extracting dense features. Thus we need use more powerful augmentation strategy or tools (albumentation may be a good choice).

2. SGD is good for generalisation. Previously I used Adam to acquire outstanding training accuracy. But soon after, I found it is useless since evaluating accuracy is poor. For the samples are too few, I can’t use my evaluating dataset (10% cut from original data) to correctly evaluate the score on competition leaderboard. Without choice, I have to use only SGD to train my model in last stage.

3. Use more visual monitor tools. At first my model have high training accuracy and low evaluating accuracy, but after I added too many regularisation methods (such as dropout, weight decay) both training accuracy and evaluating accuracy thrinked to a too low value. The key for regularistaion of DNN is “Keep the fitting capability of training, and then try to rise evaluating accuracy”. So if I could monitor both training and evaluating accuracy at realtime, I would not trapped in dilemma.

4. Thinking more, experimenting less. Spend more time to understand and check the source code, the mechanism of model, instead of only adjusting hyper-parameters in vain.

There are still some questions I can’t answer at present:

1. Why even Resnet-50 can’t raise my training mAP up to 0.5 ?

2. Why perfect mAP value in my evaluating dataset can’t gain good score in this competition’s leaderboard ?

Will go on to discover them.