Wrong ‘struct timeval’ for setsockopt()

What if we deliberately use ‘struct timeval’ like this incorrect way to set timeout of receiving to 3 seconds:

the ‘setsockopt’ will return fail (-1).

Let’s look up the linux kernel code for systemcall sys_setsockopt():

sock_setsockopt() will invoke sock_set_timeout() and sock_set_timeout() looks like:

That’s it. If ‘tv.tv_usec’ is greater than USEC_PER_SEC (which equals 1000000), it will return -EDOM and setsockopt() will fail.

Avoid “page allocation failure” for linux kernel in big memory server

After adding pressure to a Key-Value cluster, I found many error in dmesg:

It’s hard to understand the “page allocation failure” error because the memory capacity is very big in our servers. By looking at the result “free” command, I noticed that a large mount of memory was used to cache files. Maybe the “free” memory is too small so the kernel could not get enough pages when it need many.
But how to reserve more “free” memory in linux kernel? According to this article,we could modify “/proc/sys/vm/min_free_kbytes” to adjust the watermark of linux-memory-management. And the kernel will try hardly to reserve enough “free” memory:


After changing the “/proc/sys/vm/min_free_kbytes” to 1G, the errors became rare but still exists. Then I change it to 4G, and this time, there wasn’t any errors in dmesg now.
At conclude, the default value of “min_free_kbytes” in kernel is too small, we’d better turn up “min_free_kbytes” in machines with big memory.

Too many “ext4-dio-unwrit” processes in system

After adding pressure to application which will write tremendous data into ext4 file system, we see many “ext4-dio-unwrit” kernel threads in “top” screen. Many guys say this is a normal phenomenon, so I check the source code of ext4 in 2.6.32 linux kernel.

The beginning of writing a file in kernel is write-back kernel thread, it will call generic_writepages() and then ext4_write_page():

Let’s look at ext4_add_complete_io():

It will put “io_end” into the work queue “dio_unwritten_wq” which is in the ext4_sb_info. But where does the “dio_unwritten_wq” come from ? In the fs/ext4/super.c:

Oh, it is the “ext4-dio-unwritten” kernel thread! So, the problem is solved: the application dirty the page of file system cache, then the write-back kernel thread will write these dirty pages into specific file system (ext4 in this article), finally ext4 will put io into the work queue “ext4-dio-unwritten” and wait to convert unwritten-extent into written extent of ext4.
Therefore, if we don’t have unwritten-extent in ext4 (just using system-call write() to appending a normal file),the “ext4-dio-unwritten” kernel threads will exist but not using any CPU.

Upgrade to kernel-4.4.1 on CentOS 7

After I compiled and installed kernel-4.4.1 (from kernel.org) on my CentOS 7, I reboot the machine. But it can’t boot up correctly.

to extract the content in initramfs and check them, I found out the ‘mpt2sas’ kernel driver had not been added into initramfs so /boot partition could not be loaded.

Seems this problem is common. Because changing dracut source code or configure file on all servers is not viable, I chose to add command in my kernel rpm spec file:

This will add drivers to the corresponding initramfs file.

But the kernel could not boot up either. This time, I found that the command line in GRUB2 is like:

Looks we should change it to UUID. Add another command in kernel rpm spec file:

This will get UUID of boot disk from /proc/cmdline and give it to GRUB2 configure file.

Now, the kernel-4.4.1 boot up correctly on CentOS 7.

“kmem_cache_create: duplicate cache XXX”

In my kernel module, firstly I wrote:

In centos 7, this module works fine, but after port to centos 6, this kernel module reports:

The key to this problme is in the implementation of kmem_cache_create() in 2.6.32 linux kernel (for centos 6):

After creating a new pool, it only point to ‘name’, not strdup() a new one. But the ‘name’ in my kernel module is a temporary variable (in stack), so it considers the name is “duplicated”.
The correct code should like:

But why the old kernel module did not report error in centos 7? Because in centos 7 the default memory allocator is SLUB, and in centos 6 it is SLAB. They have totally different implementation.

Run docker on centos6

Docker use thin-provision of device mapper as its default storage, therefore if we wan’t run docker on centos6, we should update kernel first. I use linux kernel 4.11 and notice these kernel options should be set:

After build and reboot the kernel, I still can’t launch docker service, and finally find out the solution:

Solve a USB network card problem

I am doing some source code porting works on linux kernel recently.
After I reboot my server (ubuntu 14.04) into new kernel version of 3.19.8, it can’t be connected by using ssh but only by using serial port.
The server is using a USB network card, so firstly I suspect some kernel driver for USB NIC has missing in .config file. Therefore I boot back into the old version kernel and try to find some useful information in ‘dmesg’:

The eth3 is using MII port, so when I try to grep “link up …lpa” in 3.19.8 kernel source code, I find out it must be printed by these codes in drivers/net/mii.c:

The only place “ASIX AX88772B” driver call mii_check_media is in drivers/net/usb/asix_devices.c:

So far, the reason is that: the system does not “reset” the USB network card after booting up. But why it only forget to reset USB NIC in 3.19.8 kernel? After checking the /etc/network/interfaces:

the answer is: the device name of the USB NIC has been changed to “eth5” by udevd in 3.9.18 kernel (new version kernel recognise new network port so eth0/eth1/eth2/eth3/eth4 all has been occupied by system) so the network scripts can’t start it up.

Fix the name of USB NIC by adding below content into /etc/udev/rules.d/70-persistent-net.rules:

and add configurations into /etc/network/interfaces to start “usbnet0” up automatically:

The size of pipe in linux

We use pipe in our program and face a new problem: it fail when we try to write 16MB data into a pipe in one time. Looks pipe has a limited size. But what exactly the size is? After searching on the web, the answers are not inconsistent, some say it’s 16KB and others say it’s 64KB. Therefore I have to watch kernel code by myself to find the correct answer.
Since all the servers in my company is using ali_kernel, which is based on 2.6.32 centos kernel, I find the original routine of codes:

Looks all the operations to the pipe about write are managed by “write_pipefifio_fops”. Let’s get in:

Clearly, pipe_write() is responsed for writting. Keep going.

As above, kernel will allocate a page if new operation of write comes and pipe has not enough space. Every time it add a page, it increase the ‘pipe->nrbufs’, and if the ‘nrbufs’ is great than PIPE_BUFFERS, the routine will be blocked, which means the system-call of write() will be waiting. The ‘PIPE_BUFFERS’ is setted to 16, and a page in linux kernel is 4KB, so a pipe in ali_kernel can store 64KB (16 * 4KB) data at one time.
This condition has changed since kernel version of 3.6.35, which add a new proc entry in ‘/proc/sys/fs/pipe-max-size’.