Yearly Archives: 2014

Books I read in year 2014

Buried Cities: Pompeii, Olympia, Mycenae

I have watched “Fantastic Mr Fox” four years ago, but just read its novel this year. It’s really a interesting story even I am not a child anymore. The three bad farmers want to kill Mr Fox and his whole family, but Mr Fox just steal their goods without hurting anybody, which is a good idea to me 🙂
“Interstellar” is no doubt the most amazing science fiction this year. This is a story about a father travel a very very long time and space to rescue his children, and also, his descendants. The most attractive part of this novel for me is not those fashion or professional words like “wormhole”, “black hole” etc, but the struggling of the lone father (or the human being). It also give me much power and courage to scrabble out of the traps in my own life.

A fight to save a old server ….

Coly, who is my old colleague, bought a server computer in 2010. This server, which worth about 30,000 Yuan(RMB) 4 years ago, is composed by a ASUS KGPE-D16 motherboard, two AMD-6172 CPUs, and a 670W power supply. Recently, we need many servers to build some software, so I beg Coly to bring his 4 years old server to our small computer room in our office, and he promised.
After installing Linux on this server, I found out that only one CPU could boot up and run, the other one can’t be recognized by operation system. I asked Coly why, and he answered me maybe one CPU was broken.
I thought a mother-board with two sockets can only run one CPU was a great waste; therefore I bought a new CPU and setup it on the mother-board. But after that, the whole server couldn’t boot up totally. Having tried many ways but all failed, I turn to ask Coly to rescue his own old server.
At last, Coly come here and open the lid of the server case, and becomes astonishing very soon.
Coly: What on earth have you done to my server! Have you killed a man who sales silicone grease? You paste too much silicone grease on the CPU, even the edge of the sockets also be contaminated!
Me: Sorry, I just want to make sure the CPU will not be too hot.
Coly: Yeah, they will never be hot now —— because they can’t even run.
Then, Coly kneels on the floor and begin to clean all the silicon grease on the mother-board, which is a really hard work.

But, after clearing, the server still could not boot up.
Coly: Maybe the mother-board has broken, or maybe power supply got a shot-circuit. How much is a KGPE-D16 motherboard in TAOBAO (a very famous e-commerce website in China)? It cost me 8,000 Yuan 4 years ago.
Me: About 800 Yuan.
Coly: ….Mother board becomes cheap so fast. Then, how much is a 670W power supply in TAOBAO? It was 800 Yuan before.
Me: En, about 600 Yuan.
Coly: Ah-ha, looks power supply is a hedging tool for preserving monetary value. I reckon I will buy a large mount of power supply instead of stocks or gold to preserve my money .
In my opinion, a power supply is more stable than a mother board, so I bought a new mother board and luckily boot up the server with two CPU (24 cores). At last, no servers or machines are wasted, in my hand 🙂

Running Fedora 21 on qemu-system-aarch64

Foundation Model V8 is a emulator for ARM64 architecture, but it has many limits, such as slow CPU, small memory (about 2GB), etc. Therefore I want to run fedora on qemu emulator for aarch64. There are already many docs to introduce how to run Foundation Model V8 on Fedora 21 on the web, but none of them could point out the way to run qemu-aarch64 on Fedora 21. Spent about a week, I finally find out the path.
Step 1 Build the edk2 UEFI
Followed this article，you should download the linaro toolchain and edit the “Conf/target.txt”. Then follow these steps to build a file in “Build/ArmVirtualizationQemu-AARCH64/DEBUG_GCC48/FV/QEMU_EFI.fd”
Step 2 Build the qemu-aarch64
I recommend the latest stable QEMU-2.2.0

wget http://wiki.qemu-project.org/download/qemu-2.2.0.tar.bz2
tar jxf qemu-2.2.0.tar.bz2
cd qemu-2.2.0
./configure --target-list=aarch64-softmmu
make

Now, you have file in “qemu-2.2.0/aarch64-softmmu/qemu-system-aarch64”
Step 3 Download the Fedora 21 image for Foundation Model V8

wget https://dmarlin.fedorapeople.org/fedora-arm/aarch64/F21-20140829-foundation-v8.tar.xz
tar Jxf F21-20140829-foundation-v8.tar.xz

Now you have “F21-20140829-foundation-v8/F21-20140829-foundation-v8.img”
Step 4 Run the qemu

qemu-2.2.0/aarch64-softmmu/qemu-system-aarch64 -M virt -cpu cortex-a57 \
        -smp 1 -m 4096 \
        -bios QEMU_EFI.fd \
        -drive if=none,file=F21-20140829-foundation-v8.img,id=hd0,cache=unsafe -device virtio-blk-device,drive=hd0 \
        -netdev user,id=unet,hostfwd=tcp::5555-:22 -device virtio-net-device,netdev=unet -nographic

The Qemu will boot into the UEFI, you could select into the “shell”, in the UEFI shell:

shell>fs0:
fs0:>EFI\fedora\grubaa64.efi

Then you can enter into a grub provided by Fedora 21 and boot into it automaticly.
Step 5 Add the UEFI entry for fedora permanently (optional)

cat QEMU_EFI.fd /dev/zero | dd iflag=fullblock bs=1M count=64 of=flash0.img
dd if=/dev/zero of=flash1.img bs=1M count=64

Launch qemu-system-aarch64 by change “-bios xxx” to “-pflash flash0.img -pflash flash1.img” and after enter the UEFI shell:

shell>bcfg boot add 0 EFI\fedora\grubaa64.efi "fedora"
shell>reset

The UEFI boot entry will write into flash*.img file and the Qemu could boot into Fedora 21 automaticly now.

Run aarch64 binary on x86_64 machines

If we use qemu-arm64-system directly, it will cost too much time on IO and systemcalls.So I try to use SuSE’s userpsace mode qemu, which only reinterprets the arm64 instructions to x86_64 but processes all systemcalls to local host.
This installation manual for user-mode qemu-arm64 has tested on debian-7.7.0
Step 1, make directory for chroot.

mkdir ~/myarm

Step 2, build qemu of SuSE:

cd ~/
git clone https://github.com/susematz/qemu/tree/aarch64-1.6 qemu-arm64
cd ~/qemu-arm64
apt-get build-dep qemu
./configure --target-list=arm64-linux-user  --static --disable-werror

Add this patch to SuSE’s qemu before make it:

--- config-host.mak~    2013-10-30 16:09:57.440144183 +0000
+++ config-host.mak     2013-10-30 16:36:42.656104017 +0000
@@ -120,7 +120,7 @@
 LDFLAGS=-Wl,--warn-common -m64 -static -g
 LIBTOOLFLAGS= -Wc,-fstack-protector-all
 LIBS+=-lrt -pthread -lgthread-2.0 -lglib-2.0 -lrt -lpcre    -liscsi
-LIBS_TOOLS+=-Wl,-z,relro -lssh2 -lgcrypt   -lvdeplug -luuid  -laio
+LIBS_TOOLS+=-Wl,-z,relro -lssh2 -lgcrypt   -lvdeplug -luuid  -laio -lgpg-error
 EXESUF=
 LIBS_QGA+=-lrt -pthread -lgthread-2.0 -lglib-2.0 -lrt -lpcre
 POD2MAN=pod2man --utf8

Make the qemu and install it:

make -j8

Step 3, download arm64 disk image of Fedora 19, and extract it.

cd ~/
wget "http://dmarlin.fedorapeople.org/fedora-arm/aarch64/F19-aarch64-efi.tar.xz"
xd -d F19-aarch64-efi.tar.xz
tar xf F19-aarch64-efi.tar
virt-tar-out -a ../F19-aarch64-efi/aarch64-efi.img / - | sudo tar xf -

Copy the qemu-arm64

cd ~/qemu-arm64
cp arm64-linux-user/qemu-arm64 ~/myarm/usr/bin/qemu-arm64-static

Step 4, configure the binfmts-misc

sudo apt-get install binfmt-support qemu

and put the content below into /usr/share/binfmts/qemu-arm64 in normal system(not the chroot)

package qemu-user-static
interpreter /usr/bin/qemu-arm64-static
flags: OC
offset 0
magic \x7fELF\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\xb7
mask \xff\xff\xff\xff\xff\xff\xff\x00\xff\xff\xff\xff\xff\xff\xff\xff\xfe\xff\xff

Then run

sudo update-binfmts --import qemu-arm64

Step 5, you could chroot into the user mode qemu now:

cd ~/myarm
sudo mount -o bind /dev ./dev
sudo mount -o bind /dev/pts ./dev/pts
sudo mount -o bind /proc ./proc
sudo mount -o bind /sys ./sys
sudo chroot .

Now, If you typein “arch”, it will report “aarch64”
Addtionally, add below contents into “aarch64” environtment’s /etc/yum.repos.d/stage4.repo:

[stage4]
name=Fedora fc19-stage4 - aarch64
failovermethod=priority
baseurl=http://arm.koji.fedoraproject.org/aarch64/stage4/
enabled=1
#metadata_expire=7d
gpgcheck=0

You could use yum to install rpm of aarch64 now. If it report DNS unavailable, you could append

152.19.134.191  dmarlin.fedorapeople.org
209.132.181.31  arm.koji.fedoraproject.org

into /etc/hosts
I use this user-mode of arm64 Qemu to build linux kernel, but the speed of compiling is not very perfrect. After all, the user-mode Qemu only accelerate the speed of systemcall, not the reinterpreter of instructions. But in my opinion, it is still very useful because it could use host’s resource of network and storage efficiently (Don’t you think it looks like linux-container or lxc/docker ?). If this news is true, I think facebook could use cheap x86_64 machines to test their IOS apps instead of costly Mac minis 🙂
Reflinks:
How to run aarch64 binaries on an x86-64 host using qemu userspace emulation
Arm64Qemu

China Linux Storage & Filesystem 2014 workshop (PPT or PDF for download)

ext4: link
f2fs: link
ocfs2: link
ubifs: link
btrfs: link

China Linux Storage & Filesystem 2014 workshop (second day)

The first topic in second day of CLSF 2014 is about NFS, which lead by Tao Peng from PrimaryData. The protocol of NFS is updated to 4.2 and the mainly jobs of NFS community is implement many features (such as “server side copy”) which had been used on local file system onto server side.

Then the distributed software developer in Xiaomi —— Liang Xie introduce the basic infrastructure of Xiaomi Cloud Storage and report some severe problems about IO stack. The first problem is that heavy write pressure will cause long latency on ext4 file system, and the latency will be short if the local file system is replaced by xfs.
Zheng Liu (from Alibaba): The implement of journal in xfs is better than ext4. When large mount of write operations come to ext4, it have to write checkpoint in journal and flush disk, which may take a long time. I think you could try ‘no journal’ mode which developed by google guys in ext4.

#note: the way to use no journal mode in ext4
mkfs.ext4 -O ^has_journal, .... /dev/sdx

Another problem is Xiaomi want to use io-scheduler of deadline but they can’t use cgroup by ‘deadline’.
Coly Li (from Alibaba): I suggest that you could try tpps which is a simple and effective io-scheduler in ali_kernel.
Next topic about ext4 is hold by Zheng Liu. In this year, ext4 has add no new features (may be that’s why it is so stable). In google’s chrome OS, they want to store something like cookie for every user, so it need to add encryption feature in ext4. We ask why chrome os not using encryptfs on ext4. The answer of Zheng Liu is: the requirement is came from google itself, so no one knows why. Ext4 also add a new option “sparse_super2” (to store super block only in the beginning and the end of ext4 groups) and “packed_meta_block” (to squeeze all meta data of ext4 into the beginning of disk, mainly for the SMR disk).

The last topic is about OSv, the most popular and cutting-edge OS in this conference. OSv is a operating system based on virtual machine and cloud environment. It reduce the IO and network stack which makes it very fast and effective. The jvm and some scripts language (such as python, ruby, node.js) could already run on OSv, therefore I consider that it has wined a large part of cloud market for it could run hadoop/spark and many front-end web application.
OSv

China Linux Storage & Filesystem 2014 workshop (first day)

CLSF (China Linux Storage & File System Workshop) is an effort to make local Linux kernel hackers get together to share and exchange ideas. CLSF is an invitation only workshop, in order to have effective communication, only a small group of people will be invited. Most of the invitees are active upstream Linux kernel developers locally from China, focus on I/O and storage sub-systems.
CLSF 2014 was hold in office of XiaoMi which is a famous consumer electronics company in china. Participators are mainly from Huawei, Fujitsu, Intel, Alibaba and other companies.
The first topic lead by Jiufei Xue from Huawei is about ocfs2. Huawei was building their private cloud product on ocfs2, so in recent two years the kernel developers in Huawei commited many fix patches and new features into ocfs2 community. In this year, they add range lock into ocfs2, so users could not only lock the whole file but a specific range of one file, which will promote the performance in cluster when many clients read/write files at the same time.

F2FS is a new filesystem in kernel. It is based on devices like SD-card. It stores meta-data in the beginning of device (The random read/write performance in the beginning of SD-card is very well) and use indirect-block layout just like ext2. The reason of not using extent (like ext4) is for its convenience of garbage collection of NAND block. F2FS also merges many ‘sync’ operations into one, mainly for the speed (The ‘sync’ operation on SD-card in mobile-phone is very slow).
Zeifan Li (from Huawei): If one day, the SD-card adds FTL layer into its controller and let SD-card works just like enterprise SSD. Will the advantage of F2FS be disappear?
Ying Huang (from Intel): Let’s look at this another way. If F2FS do the whole work of SSD firmware, we could use kernel file system on NAND flash directly, which will save a lot of money 🙂
f2f2

Bo Liu from Oracle hold the topic about btrfs. The new features for btrfs decreased in this year, and the main work is fixing bugs. Using the standard kernel worker thread in btrfs cause a serious bug which would lost user’s data. Bo Liu spends a long time to repeat the bug, and fortunately, fix it at last.
btrfs

Coly (from Alibaba): I have joined the linux plumber conference in this year. On a presentation about docker, the speaker complained that btrfs is the most unstable filesystem in linux kernel. The developers in btrfs community may have to spend more time on fixing bugs.
The last topic in first day is hold by an engineer from Memblaze (a china-version fusion-io company). They face many problmes in building All-Flash-Array (called AFA). In AFA, the linux system in box has become bottleneck in the whole IO path: too much interrupts cost too much CPU and time; the implemention of socket and TCP/IP stack in linux is too insufficient; context switch for processes also make IOPS unstable; even filesystem itself spend too much time in searching files.
memblaze

Besides these problems, they also speak out a new viewpoint: SSD will become a perfect device to store cold data as NVME and PCM will become heap very quickly. For this point, we began a long and heated discussion. If PCM comes into the market, the Block-layer and filesystem in linux kernel will become obsolete, and almost all the guys sit in the meeting-room will be out of work :). So, maybe the time has come closely for the big chang in storage area.
clsf 2014

The size of pipe in linux

We use pipe in our program and face a new problem: it fail when we try to write 16MB data into a pipe in one time. Looks pipe has a limited size. But what exactly the size is? After searching on the web, the answers are not inconsistent, some say it’s 16KB and others say it’s 64KB. Therefore I have to watch kernel code by myself to find the correct answer.
Since all the servers in my company is using ali_kernel, which is based on 2.6.32 centos kernel, I find the original routine of codes:

sys_pipe() --> sys_pipe2() --> do_pipe_flags() --> create_write_pipe():
struct file *create_write_pipe(int flags)
{
......
        path.dentry->d_flags &= ~DCACHE_UNHASHED;
        d_instantiate(path.dentry, inode);
        err = -ENFILE;
        f = alloc_file(&path, FMODE_WRITE, &write_pipefifo_fops);
        if (!f)
                goto err_dentry;
        f->f_mapping = inode->i_mapping;
......

Looks all the operations to the pipe about write are managed by “write_pipefifio_fops”. Let’s get in:

const struct file_operations write_pipefifo_fops = {
        .llseek         = no_llseek,
        .read           = bad_pipe_r,
        .write          = do_sync_write,
        .aio_write      = pipe_write,
        .poll           = pipe_poll,
        .unlocked_ioctl = pipe_ioctl,
        .open           = pipe_write_open,
        .release        = pipe_write_release,
        .fasync         = pipe_write_fasync,
};

Clearly, pipe_write() is responsed for writting. Keep going.

static ssize_t
pipe_write(struct kiocb *iocb, const struct iovec *_iov,
            unsigned long nr_segs, loff_t ppos)
{
......
        for (;;) {
                int bufs;
                if (!pipe->readers) {
                        send_sig(SIGPIPE, current, 0);
                        if (!ret)
                                ret = -EPIPE;
                        break;
                }
                bufs = pipe->nrbufs;
                if (bufs < PIPE_BUFFERS) {
                        int newbuf = (pipe->curbuf + bufs) & (PIPE_BUFFERS-1);
                        struct pipe_buffer *buf = pipe->bufs + newbuf;
                        struct page *page = pipe->tmp_page;
                        char *src;
                        int error, atomic = 1;
                        if (!page) {
                                page = alloc_page(GFP_HIGHUSER);
                                if (unlikely(!page)) {
                                        ret = ret ? : -ENOMEM;
                                        break;
                                }
                                pipe->tmp_page = page;
                        }
......
                        pipe->nrbufs = ++bufs;
                        pipe->tmp_page = NULL;
                        total_len -= chars;
                        if (!total_len)
                                break;
                }
......
                pipe_wait(pipe);
......

As above, kernel will allocate a page if new operation of write comes and pipe has not enough space. Every time it add a page, it increase the ‘pipe->nrbufs’, and if the ‘nrbufs’ is great than PIPE_BUFFERS, the routine will be blocked, which means the system-call of write() will be waiting. The ‘PIPE_BUFFERS’ is setted to 16, and a page in linux kernel is 4KB, so a pipe in ali_kernel can store 64KB (16 * 4KB) data at one time.
This condition has changed since kernel version of 3.6.35, which add a new proc entry in ‘/proc/sys/fs/pipe-max-size’.

Problems about using zookeeper

Problem 1:
The zookeeper cluster is running well for half year a year. But today, after I re-configurate it and run command

zkServer.sh start-foreground

It failed to startup and report

log4j:WARN No appenders could be found for logger (org.apache.zookeeper.server.quorum.QuorumPeerConfig).
log4j:WARN Please initialize the log4j system properly.
Invalid config, exiting abnormally

The point is the last term “Invalid config”(log4j is just warning); therefore I reviewed zoo.cfg many times but finding no mistake utterly.
After checking all configurations, I eventually find out the problem: the file “myid” missed. After adding the “myid” file, zookeeper startup correctly.

echo [hostname or ip] > /var/log/zookeeper/myid (The path is 'dataDir' in zoo.cfg)

It seems the error log of zookeeper is misleading——it says the config file is invalid but the true reason is missing of a config file.
Problem 2:
For tolerating failure of four servers at most, we assumed that a five-servers zookeeper cluster will be enough. After learning of Paxos for a while, a problem occurs on me: the majority of five-servers-cluster is three-servers, how could zookeeper works to elect a new leader if more than two servers are down? So I do the test and find out that the zookeeper do fail to work if more than two servers are shutdown.
The correct number of zookeeper cluster which could tolerate failure of four servers is nine; because after four servers shutdown, the five survivors is also the majority of nine-server-cluster.