C++/Java developers needed

I worked in Alibaba Group for more than 9 years. Recently I am working in Alimama, a sub-company of Alibaba Group and has been the biggest Advertisement Publishing Company in China. At present, we need C++/Java developers to build new back-end basic services for our new business.

[Job Description]

Role: C++/Java Developer for storage system or high performance computing

Location: Beijing

Your responsibilities:

1. Building and optimizing the distributed key-value storage system
2. Building and optimizing the distributed computing engine of Linear Regression algorithm
3. Building and maintaining the backend service for Advertisement Publishing System

Skins & experience required:

1. Familiar with storage system or hight performance computing system
2. Strong background about Redis/Rocksdb/Hadoop/Glusterfs
3. Very familiar with one of C/C++/Java/Scala language
4. More than 3 years experience about storage system or HPC as a developer
5. Passionate about new Technologies and wanting to continuously push the boundaries

Any one who is interesting in the job above could send email to my email: haodong@alibaba-inc.com

The CPU usage of soft-irq will be counted into a process.

Redis usually runs as a single process daemon that is the perfect sample of UNIX philosophy — one process just do one thing. But current servers have many CPUs(Cores) so we need to launch many Redis processes to provide service. After running multi-process redis-server in server, I found out there always be one redis-server daemon cost more CPU than others in “top” command:




I used perf to collect function samples in different processes and be noticed that some “softirq” function had been called. Then I remember I haven’t balance the soft-irq of netowork card into different CPU cores. After running my script to balance the soft-irq:

The view of “top” command looks much better now:



But I still have a question: why the “top” command count the CPU usage of system soft-irq into a innocent process? The answer is here: The soft-irq is run under process-context, so it certainly need to find a “scapegoat” process to count the CPU usage.

Performance bottleneck in Jedis

I have had write a test program which using Jedis to read/write Redis Cluster. It create 32 threads and every thread init a instance of JedisCLuster(). But it will cost more than half minute to create total 32 JedisCluster Instances.
By tracking the problem, I found out that the bottleneck is in setNodeIfNotExist():

In the method setNodeIfNotExist() of class JedisClusterInfoCache, “new JedisPool()” will cost a lot of time because it will use apache commons pool and apache-commons-pool will register MBean Server with JMX. The register operation of JMX is the bottelneck.

The first solution for this problem is to disable JMX when calling JedisCluster():

The second solution is “create one JedisCluster() instance for all threads”. After I commited patch for Jedis to set JMX disable as default, Marcos Nils remind me that JedisCluster() is thread-safe, for it has using commons-pool to manage the connection resource.

Perf to the detail of stack frame

Use perf to profile the redis daemon

The report shows:

I could only see that function “ziplistFind” is very hot in redis but I can’t see how the code routine reaches the “ziplistFind”. By searching around I find this article. So I use perf as:

The report indicates more details in program now:

Import data to Redis from JSON format

Redis use rdb file to make data persistent. Yesterday, I used the redis-rdb-tools for dumping data from rdb file to JSON format. After that, I write scala code to read data from JSON file and put it into Redis.

Firstly, I found out that the JSON file is almost 50% bigger than rdb file. After checking the whole JSON file, I make sure that the root cause is not the redundant symbols in JSON such as braces and brackets but the “Unicode transformation” in JSON format, especially the “\u0001” for ASCII of “0x01”. Therefore I write code to replace it:

Then the size of JSON file became normal.

There was still another problem. To read the JSON file line by line, I use code from http://naildrivin5.com/blog/2010/01/26/reading-a-file-in-scala-ruby-java.html:

But this code will load all data from file and then run “foreach”. As my file is bigger than 100GB, it will cost too much time and RAM ….
The best way is to use IOStream in java:

This is exactly read file “line by line”.

From scala Array[String] / Seq[String] to java varargs

While testing performance of redis these days, I need to use mset() interface of jedis (a java version redis client). But the prototype of mset() in jedis is:

Firstly I write my scala code like:

But it report compiling errors:

After searching many documents about scala/java on google, I finally find the answer: http://docs.scala-lang.org/style/types.html. So, let’s write code this way:

Then Array[String] of scala changes to varargs in java now. It also viable for Seq[String].

Performance test for unikernels (Rumpkernel and OSv)

Unikernels are specialised, single-address-space machine images constructed by using library operating systems. The concept of Unikernel is very old (Since 1980s in embeded system), but become more and more popular in this cloud computing age for its portability and security.
In recent days, I tested two famous unikernel production: Rumpkernel and OSv by running redis in them.

1. Run redis in Rumpkernel (KVM)
Firstly, build rumpkernel and its environment as “https://github.com/rumpkernel/wiki/wiki/Tutorial%3A-Serve-a-static-website-as-a-Unikernel”, then

2. Run redis in OSv (KVM)
Firsty, build OSv by the tutorial of “https://github.com/cloudius-systems/osv/”, and the virbr0 network (as qemu/kvm usually do), then

3. Run redis on host (centos 7 on bare hardware)

4. Use benchmark tool to test it
I choose memtier_benchmark as the benchmark tool.

5. The test result


Request per second
unikernels
unikernels


Latency (unit for Y axle: micro second)
unikernels
unikernels

Looks the performance of OSv is better than Rumpkernel. But still, they all much slower than running on bare hardware. The bottle-neck in this test case is network, so may be we should find a way to bypass tap or bridge.