China Linux Storage & Filesystem 2014 workshop (second day)

The first topic in second day of CLSF 2014 is about NFS, which lead by Tao Peng from PrimaryData. The protocol of NFS is updated to 4.2 and the mainly jobs of NFS community is implement many features (such as “server side copy”) which had been used on local file system onto server side.

nfs/

Then the distributed software developer in Xiaomi —— Liang Xie introduce the basic infrastructure of Xiaomi Cloud Storage and report some severe problems about IO stack. The first problem is that heavy write pressure will cause long latency on ext4 file system, and the latency will be short if the local file system is replaced by xfs.

Zheng Liu (from Alibaba): The implement of journal in xfs is better than ext4. When large mount of write operations come to ext4, it have to write checkpoint in journal and flush disk, which may take a long time. I think you could try ‘no journal’ mode which developed by google guys in ext4.

Another problem is Xiaomi want to use io-scheduler of deadline but they can’t use cgroup by ‘deadline’.

Coly Li (from Alibaba): I suggest that you could try tpps which is a simple and effective io-scheduler in ali_kernel.

Next topic about ext4 is hold by Zheng Liu. In this year, ext4 has add no new features (may be that’s why it is so stable). In google’s chrome OS, they want to store something like cookie for every user, so it need to add encryption feature in ext4. We ask why chrome os not using encryptfs on ext4. The answer of Zheng Liu is: the requirement is came from google itself, so no one knows why. Ext4 also add a new option “sparse_super2” (to store super block only in the beginning and the end of ext4 groups) and “packed_meta_block” (to squeeze all meta data of ext4 into the beginning of disk, mainly for the SMR disk).

ext4

The last topic is about OSv, the most popular and cutting-edge OS in this conference. OSv is a operating system based on virtual machine and cloud environment. It reduce the IO and network stack which makes it very fast and effective. The jvm and some scripts language (such as python, ruby, node.js) could already run on OSv, therefore I consider that it has wined a large part of cloud market for it could run hadoop/spark and many front-end web application.

OSv
OSv
OSv

China Linux Storage & Filesystem 2014 workshop (first day)

CLSF (China Linux Storage & File System Workshop) is an effort to make local Linux kernel hackers get together to share and exchange ideas. CLSF is an invitation only workshop, in order to have effective communication, only a small group of people will be invited. Most of the invitees are active upstream Linux kernel developers locally from China, focus on I/O and storage sub-systems.

CLSF 2014 was hold in office of XiaoMi which is a famous consumer electronics company in china. Participators are mainly from Huawei, Fujitsu, Intel, Alibaba and other companies.

The first topic lead by Jiufei Xue from Huawei is about ocfs2. Huawei was building their private cloud product on ocfs2, so in recent two years the kernel developers in Huawei commited many fix patches and new features into ocfs2 community. In this year, they add range lock into ocfs2, so users could not only lock the whole file but a specific range of one file, which will promote the performance in cluster when many clients read/write files at the same time.

ocfs2

F2FS is a new filesystem in kernel. It is based on devices like SD-card. It stores meta-data in the beginning of device (The random read/write performance in the beginning of SD-card is very well) and use indirect-block layout just like ext2. The reason of not using extent (like ext4) is for its convenience of garbage collection of NAND block. F2FS also merges many ‘sync’ operations into one, mainly for the speed (The ‘sync’ operation on SD-card in mobile-phone is very slow).

Zeifan Li (from Huawei): If one day, the SD-card adds FTL layer into its controller and let SD-card works just like enterprise SSD. Will the advantage of F2FS be disappear?
Ying Huang (from Intel): Let’s look at this another way. If F2FS do the whole work of SSD firmware, we could use kernel file system on NAND flash directly, which will save a lot of money 🙂

f2f2

Bo Liu from Oracle hold the topic about btrfs. The new features for btrfs decreased in this year, and the main work is fixing bugs. Using the standard kernel worker thread in btrfs cause a serious bug which would lost user’s data. Bo Liu spends a long time to repeat the bug, and fortunately, fix it at last.

btrfs

Coly (from Alibaba): I have joined the linux plumber conference in this year. On a presentation about docker, the speaker complained that btrfs is the most unstable filesystem in linux kernel. The developers in btrfs community may have to spend more time on fixing bugs.

The last topic in first day is hold by an engineer from Memblaze (a china-version fusion-io company). They face many problmes in building All-Flash-Array (called AFA). In AFA, the linux system in box has become bottleneck in the whole IO path: too much interrupts cost too much CPU and time; the implemention of socket and TCP/IP stack in linux is too insufficient; context switch for processes also make IOPS unstable; even filesystem itself spend too much time in searching files.

memblaze

Besides these problems, they also speak out a new viewpoint: SSD will become a perfect device to store cold data as NVME and PCM will become heap very quickly. For this point, we began a long and heated discussion. If PCM comes into the market, the Block-layer and filesystem in linux kernel will become obsolete, and almost all the guys sit in the meeting-room will be out of work :). So, maybe the time has come closely for the big chang in storage area.

clsf 2014

The size of pipe in linux

We use pipe in our program and face a new problem: it fail when we try to write 16MB data into a pipe in one time. Looks pipe has a limited size. But what exactly the size is? After searching on the web, the answers are not inconsistent, some say it’s 16KB and others say it’s 64KB. Therefore I have to watch kernel code by myself to find the correct answer.
Since all the servers in my company is using ali_kernel, which is based on 2.6.32 centos kernel, I find the original routine of codes:

Looks all the operations to the pipe about write are managed by “write_pipefifio_fops”. Let’s get in:

Clearly, pipe_write() is responsed for writting. Keep going.

As above, kernel will allocate a page if new operation of write comes and pipe has not enough space. Every time it add a page, it increase the ‘pipe->nrbufs’, and if the ‘nrbufs’ is great than PIPE_BUFFERS, the routine will be blocked, which means the system-call of write() will be waiting. The ‘PIPE_BUFFERS’ is setted to 16, and a page in linux kernel is 4KB, so a pipe in ali_kernel can store 64KB (16 * 4KB) data at one time.
This condition has changed since kernel version of 3.6.35, which add a new proc entry in ‘/proc/sys/fs/pipe-max-size’.

Problems about using zookeeper

Problem 1:

The zookeeper cluster is running well for half year a year. But today, after I re-configurate it and run command

It failed to startup and report

The point is the last term “Invalid config”(log4j is just warning); therefore I reviewed zoo.cfg many times but finding no mistake utterly.
After checking all configurations, I eventually find out the problem: the file “myid” missed. After adding the “myid” file, zookeeper startup correctly.

It seems the error log of zookeeper is misleading——it says the config file is invalid but the true reason is missing of a config file.

Problem 2:

For tolerating failure of four servers at most, we assumed that a five-servers zookeeper cluster will be enough. After learning of Paxos for a while, a problem occurs on me: the majority of five-servers-cluster is three-servers, how could zookeeper works to elect a new leader if more than two servers are down? So I do the test and find out that the zookeeper do fail to work if more than two servers are shutdown.
The correct number of zookeeper cluster which could tolerate failure of four servers is nine; because after four servers shutdown, the five survivors is also the majority of nine-server-cluster.