Yearly Archives: 2015

Books I read in year 2015

The first book is about network hardware, like router, switcher. As a coder, I usually use servers on cloud, therefore haven’t see the real high performance routers (I have sought bare server, 1Gb switcher). This book open my eyes.
The second book is about how to build Datacenter. It’s really a work for architecture, not IT guys.
About two years ago, I worked with Mysql team in my company as a kernel developer. We have used PCIE-card of NAND and flashcache as our solution for Mysql to process hight throughput pressure. But util this year, I have read over the architecture of InnoDB Engine which is the most powerful and effective engine in Mysql. Actually, it’s not so difficult to have a overview of the InnoDB Engine in a book. But, it is still very hard to understand the code of it 🙂
I haven’t go to cinema to watch “The Martian” because I have read it in my Kindle on my commute everyday. It is really a sci-fi story for Geeks who like do research on Computer,Chemistry,Physics,etc. The only question I want to ask the author is:” How could you invent so much troubles on Mars to torture Mark Watney?”

Performance test for unikernels (Rumpkernel and OSv)

Unikernels are specialised, single-address-space machine images constructed by using library operating systems. The concept of Unikernel is very old (Since 1980s in embeded system), but become more and more popular in this cloud computing age for its portability and security.
In recent days, I tested two famous unikernel production: Rumpkernel and OSv by running redis in them.
1. Run redis in Rumpkernel (KVM)
Firstly, build rumpkernel and its environment as “https://github.com/rumpkernel/wiki/wiki/Tutorial%3A-Serve-a-static-website-as-a-Unikernel”, then

git clone https://github.com/rumpkernel/rumprun-packages.git
cd rumprun-packages/redis
make # we get "bin/redis-server" now
rumprun-bake hw_virtio ./redis.bin bin/redis-server # we get "redis.bin" now, so try to run it (make sure you have configured tap)
rumprun kvm -i -M 4096 \
        -I if,vioif,'-net tap,script=no,ifname=tap0'\
        -W if,inet,static,10.0.120.101/24 \
        -b images/data.iso,/data \
        -- ./redis.bin /data/conf/redis.conf

2. Run redis in OSv (KVM)
Firsty, build OSv by the tutorial of “https://github.com/cloudius-systems/osv/”, and the virbr0 network (as qemu/kvm usually do), then

vim apps/redis-memonly/GET # change the redis version to 3.0.2 (as same as redis in rumpkernel)
./scripts/build image=redis-memonly -j20
# run it (use only one cpu core, as rumpkernel)
sudo ./scripts/run.py  -nv -c 1

3. Run redis on host (centos 7 on bare hardware)

wget "http://download.redis.io/releases/redis-3.0.2.tar.gz"
tar xzf redis-3.0.2.tar.gz
cd redis-3.0.2
./configure
make -j20
./src/redis-server

4. Use benchmark tool to test it
I choose memtier_benchmark as the benchmark tool.
5. The test result

Request per second
unikernels

Latency (unit for Y axle: micro second)
unikernels

Looks the performance of OSv is better than Rumpkernel. But still, they all much slower than running on bare hardware. The bottle-neck in this test case is network, so may be we should find a way to bypass tap or bridge.

How to disable the creation of the rescue initramfs on CentOS 7

I was trying to install new linux kernel (4.4-rc5) in my CentOS 7 server. But when I run “sudo make install” it report

$sudo make install
sh ./arch/x86/boot/install.sh 4.4.0-rc5 arch/x86/boot/bzImage \
        System.map "/boot"
gzip: stdout: No space left on device
dracut: creation of /boot/initramfs-0-rescue-a5ad1e5b00de400bbc8e83ec69fbe9ee.img failed
cp: error writing '/boot/vmlinuz-0-rescue-a5ad1e5b00de400bbc8e83ec69fbe9ee': No space left on device
cp: failed to extend '/boot/vmlinuz-0-rescue-a5ad1e5b00de400bbc8e83ec69fbe9ee': No space left on device
grubby: error writing /boot/grub2/grub.cfg-: No space left on device
grubby: error writing /boot/grub2/grub.cfg-: No space left on device

The initramfs-0-rescue-XXX file occupied too much space in boot device. Then I found this article. But after adding dracut_rescue_image=”no” into /etc/dracut.conf, the problem was still exist.
Finally, I use

$grep dracut_rescue /usr/lib/dracut/* -rn
/usr/lib/dracut/dracut.conf.d/02-rescue.conf:1:dracut_rescue_image="yes"<

Therefore, the worked configuration item for dracut is in /usr/lib/dracut/dracut.conf.d/02-rescue.conf instead of /etc/dracut.conf on centos 7. The final solution is

$vim /usr/lib/dracut/dracut.conf.d/02-rescue.conf
#change "yes" to "no"
dracut_rescue_image="no"

“kmem_cache_create: duplicate cache XXX”

In my kernel module, firstly I wrote:

int alloc_device(const char *name, int number)
{
    char name[64];
    snprintf(name, sizeof(name), "worker%d", number);
    request_cache = kmem_cache_create(name, SECTOR_SIZE, 0, NULL, NULL);
    ......
}

In centos 7, this module works fine, but after port to centos 6, this kernel module reports:

kmem_cache_create: duplicate cache worker0
......

The key to this problme is in the implementation of kmem_cache_create() in 2.6.32 linux kernel (for centos 6):

struct kmem_cache *
kmem_cache_create (const char *name, size_t size, size_t align,
        unsigned long flags, void (*ctor)(void *))
{
        ......
        cachep->ctor = ctor;
        cachep->name = name;
......

After creating a new pool, it only point to ‘name’, not strdup() a new one. But the ‘name’ in my kernel module is a temporary variable (in stack), so it considers the name is “duplicated”.
The correct code should like:

static char *names[64];
/* before calling alloc_device() */
names = kcalloc(NR_OF_DEVICE, 64, GFP_KERNEL);
......
int alloc_device(const char *name, int number)
{
    snprintf(names[number], 64, "worker%d", number);
    request_cache = kmem_cache_create(names[number], SECTOR_SIZE, 0, NULL, NULL);
    ......
}

But why the old kernel module did not report error in centos 7? Because in centos 7 the default memory allocator is SLUB, and in centos 6 it is SLAB. They have totally different implementation.

“Cache flush bypassed!” from fio

fio is a effective tool to test IO-performance of a block device (also file system).
Today, my colleague tell me fio has report “Cache flush bypassed!” which means all IO have bypass the device cache. But I can’t agree because the cache of a RAID card usually could only be change by specific tool (such as MegaCLi), but not a test tool.
By review the code of fio:

static int __file_invalidate_cache(struct thread_data *td, struct fio_file *f,
                                   unsigned long long off,
                                   unsigned long long len)
{
......
        } else if (f->filetype == FIO_TYPE_BD) {
                int retry_count = 0;
                ret = blockdev_invalidate_cache(f);
                while (ret < 0 && errno == EAGAIN && retry_count++ < 25) {
                        /*
                         * Linux multipath devices reject ioctl while
                         * the maps are being updated. That window can
                         * last tens of milliseconds; we'll try up to
                         * a quarter of a second.
                         */
                        usleep(10000);
                        ret = blockdev_invalidate_cache(f);
                }
                if (ret < 0 && errno == EACCES && geteuid()) {
                        if (!root_warn) {
                                log_err("fio: only root may flush block "
                                        "devices. Cache flush bypassed!\n");
                                root_warn = 1;
                        }
                        ret = 0;
                }

and the implementation for blockdev_invalidate_cache() is:

static inline int blockdev_invalidate_cache(struct fio_file *f)
{
        return ioctl(f->fd, BLKFLSBUF);
}

Therefore, the "Cache flush bypassed!" is not mean all IO will bypass the buffer of device, but actually means: "fio can't flush the cache of device, so let's ignore it".
If you want to disable the DRAM cache on the RAID card, the correct way is set cache policy of RAID card to "Write Through":

sudo /opt/MegaRAID/MegaCli/MegaCli64 -LDSetProp -WT -Immediate -Lall -aALL

Use recv() instead of read() for socket

When we use read() to read data from a socket like:

ret = read(fd, buffer, sizeof(struct msg));

the read() may return ‘ret’ which is small than sizeof(struct msg) even the socket is not O_NONBLOCKING.
The correct way is:

ret = recv(fd, buffer, sizeof(struct msg), MSG_WAITALL);

Then, recv() will wait util all sizeof(struct msg) be read out.

China Linux Storage & Filesystem 2015 workshop (second day)

Zheng Liu from Alibaba lead the topic about ext4. The most important change in EXT-series filesystem this year is: ext3 has gone, people could only use ext3 by mount ext4 with special arguments in latest kernel (actually, in CentOS 7.0). Encrypt feature has complete in ext4.
Robin Dong (Yes, it’s me) from Alibaba give a presentation about cold storage (Slide is here). We develop distributed storage system based on a small open-source software called “sheepdog“, and modified it heavily to improve data recovery performance and make sure it could run in low-end but high-density storage servers.

Discussion in tea break

Yanhai Zhu from Alibaba (We have done so much works on storage) lead a topic about cache in virtual machines environment. Alibaba choose Bcache as code base to develop a new cache software.
Robin: Why Bcache? Why not flashcache?
Yanhai: I started my work on flashcache first, but flashcache is not profit to the product environment. First, flashcache is unfriendly to sequential-write. Second, it use hash data structure to distributed IO requests at beginning, which will split the cache data in multi-tenant environment. Bcache use B-tree instead of hash-table to store data, it’s better for our requirements.
They use radical write-back strategy on cache. It works very well because the cache sequentialize the write IOs and make backend easy to absorb the pressure peak.
The last topic is lead by Zhongjie Wu from Memblaze, a famous startup company in China on flash storage technology. It’s about NVDIMM, the most hot hardware technology in recent years. A NVDIMM is not expensive, it is only a DDR DIMM with a capacitance. Memblaze has develop a new 1U storage server with a NVDIMM and many flash cards. It contain their own developed OS and could use Fabric-Channel/Ethernet to connect to client. The main purpose of NVDIMM is to reduce latency, and they use write-back strategy(Surely).
The big problem they face with NVDIMM is CPU can’t flush data in its L1 cache to NVDIMM when whole server powers down. To solve this problem, Memblaze use write-combining in CPU multi-cores, it hurts the performance a little but avoid the data missing finally.

clsf2015

All the staff in this CLSF 2015

Articles from other attenders:
https://blogs.oracle.com/linuxkernel/entry/china_linux_storage_and_file1

China Linux Storage & Filesystem 2015 workshop (first day)

The first topic is lead by Haomai Wang from XSKY. He first introduce some basic concepts about Ceph and I did catch this opportunity to ask some questions.
Robin(from Alibaba): Dose Ceph cache all meta-data information (may called “cluster map”) on monitor-nodes so client could fetch data by just one jump in network?
Haomai: Yes. One jump, and comes to the OSD.
Robin: If I use Cephfs, is it still one jump?
Haomai: Still one jump. Although we add MDS in Cephfs, the MDS does not store data or meta-data of filesystem but only to store the context of distributed lock.
Ceph also support samba2.0/3.0 now. In linux, it is recommend to use iSCSI to access ceph storage cluster because it will have to update kernel in clients if we use rbd/libceph kernel modules. Ceph use pipeline model in message processing therefore it is good to Hard Disk but not SSD. In the future, developers will use async-framework (such as Seastar) to refactor the ceph.
Robin: If I use three replications in ceph, will the client write three copies concurrently?
Haomai: No. Firstly, the IO will come to the primary OSD, and then the primary OSD will issue two other replicated IOs to other two OSDs, waiting until the two IOs back, and return to client “the IO is success”.
Robin: Ok, now we still have two jumps….Is it difficult to change OSD to write at the same time so we can make the latency of ceph low?
Haomai: That will not be easy. Ceph use primary OSD to make sure the consistent of writing transaction.
The future developing plan for ceph is de-duplication on pool level. Coly Li(from Suse) said that de-duplication is better to be made on business level instead of block level because the duplicated information has be split in block level. But the developers in ceph community looks still want make ceph to be omnipotent.

Discussion about ceph

Jiaju Zhang from Redhat lead the topic about use cases of ceph in enterprises. Ceph has become the most famous open source storage software around the world and also be used in Redhat/Intel/Sandisk(Low-end Storage Array)/Samsung/Suse.
Next topic is about ScyllaDB and Seastar. Asias He from OSv lead this topic. ScyllaDB is a distributed Key/Value store engine which is written in C++14 code and completely compatible to Cassandra. It could also run CQL (Cassandra Query Language). In the graph, ScyllaDB is 40 times more faster than Cassandra. The asynchronous developing framework in ScyllaDB is called Seastar.
Robin: What’s the magic in ScyllaDB?
Asias: We shard requests to every CPU core, and run with no locks/no threads. Data is zero-copy and use bi-direction queue to transfer messages between cores. The test result is base on kernel TCP/IP network stack but we will use our own network stack in our future.
Yanhai Zhu(from Alibaba): I think the test you do is not fair enough: ScyllaDB is designed to be run in multi-cores but Cassandra is not. You guys should run 24 Cassandra instances to compare with ScyllaDB, not just one.
Asias: May be you are right. But ScyllaDB use message queues to transfer messages between CPU cores, so it avoid atomic-operation and lock-operation cost. And, Cassandra is written by Java, which means the performance will be low when the JVM do garbage- collection. ScyllaDB is written completely by c++ so its performance is much steady.
Last topic today is lead by Xu Wang, the CTO of Hyper ( A startup company in china, works on how to run container like VM).
Hyper means “hypervisor” addd “docker image”. Customers could run docker image on Xen/KVM/Virtualbox now.

clsf2015

The guy on the right side is Xu Wang

Run docker on centos6

Docker use thin-provision of device mapper as its default storage, therefore if we wan’t run docker on centos6, we should update kernel first. I use linux kernel 4.11 and notice these kernel options should be set:

CONFIG_DM_THIN_PROVISIONING=m
CONFIG_NF_NAT=m
CONFIG_NF_NAT_MASQUERADE_IPV4=m
CONFIG_MEMCG=y
CONFIG_IP_NF_TARGET_MASQUERADE

After build and reboot the kernel, I still can’t launch docker service, and finally find out the solution:

sudo route del -net 172.16.0.0 netmask 255.240.0.0

A example of Mesos Python Framework to calculate Pi

I have written an example of Mesos Framework by python. It simply calculate “Pi” by using Mento-Carlo algorithm
The whole source code is at https://github.com/RobinDong/mesos-python-examples/tree/master/calculate_pi
At beginning, I use python threading in “launchTask()”:

class PiExecutor(mesos.interface.Executor):
    def launchTask(self, driver, task):
        def run_thread():
        ......
        thread = threading.Thread(target=run_thread)
        thread.start()

But I found out that the executors only spend a small part of CPU resource in slave machines (about 100%~150%, which is too low in a 8-cores server). First, I thought it may be limited by cgroup, but later, the answer is revealed: multi-threaded python application is actually single thread because of GIL.
With no choice, I have to change my code: the thread will launch a new process, then the new process will calculate the result and finally return the result to thread. It works fine in my Mesos cluster.