Monthly Archives: March 2016

The CPU usage of soft-irq will be counted into a process.

Redis usually runs as a single process daemon that is the perfect sample of UNIX philosophy — one process just do one thing. But current servers have many CPUs(Cores) so we need to launch many Redis processes to provide service. After running multi-process redis-server in server, I found out there always be one redis-server daemon cost more CPU than others in “top” command:

I used perf to collect function samples in different processes and be noticed that some “softirq” function had been called. Then I remember I haven’t balance the soft-irq of netowork card into different CPU cores. After running my script to balance the soft-irq:

irqs=`cat /proc/interrupts |grep eth.*Tx|awk -F: '{print $1}'`
core=1
for i in $irqs; do
    affinity=`echo "obase=16; $core" | bc`
    echo $affinity  > /proc/irq/${i}/smp_affinity
    core=`expr $core \* 2`
done

The view of “top” command looks much better now:

But I still have a question: why the “top” command count the CPU usage of system soft-irq into a innocent process? The answer is here: The soft-irq is run under process-context, so it certainly need to find a “scapegoat” process to count the CPU usage.

Avoid “page allocation failure” for linux kernel in big memory server

After adding pressure to a Key-Value cluster, I found many error in dmesg:

[551336.912108]  [] ? dequeue_task+0x8e/0xb0
[551336.912114]  [] ? ext4_get_block+0x0/0x120 [ext4]
[551336.912118]  [] ? __do_fault+0xd0/0x530
[551336.912122]  [] ? copy_user_generic+0xe/0x20
[551336.912124]  [] ? handle_pte_fault+0x9c/0xba0
[551336.912131]  [] ? rwsem_down_failed_common+0x95/0x1e0
[551336.912134]  [] ? rwsem_down_read_failed+0x26/0x30
[551336.912137]  [] ? handle_mm_fault+0x23a/0x310
[551336.912142]  [] ? call_rwsem_down_read_failed+0x14/0x30
[551336.912145]  [] ? __do_page_fault+0x139/0x480
[551336.912149]  [] ? finish_task_switch+0x4f/0xe0
[551336.912152]  [] ? do_page_fault+0x3e/0xb0
[551336.912156]  [] ? page_fault+0x25/0x30
[552116.858565] swapper: page allocation failure. order:1, mode:0x20
[552116.858569] Pid: 0, comm: swapper Tainted: G           --------------- H  #1
[552116.858571] Call Trace:
[552116.858573]    [] ? __alloc_pages_nodemask+0x76a/0x8f0
[552116.858588]  [] ? dev_hard_start_xmit+0x303/0x570
[552116.858593]  [] ? kmem_getpages+0x62/0x170
[552116.858596]  [] ? fallback_alloc+0x1be/0x270
[552116.858599]  [] ? cache_grow+0x2d1/0x320
[552116.858602]  [] ? ____cache_alloc_node+0x99/0x160
[552116.858605]  [] ? kmem_cache_alloc+0x11b/0x1b0
[552116.858610]  [] ? sk_prot_alloc+0x48/0x1d0
[552116.858615]  [] ? sk_clone+0x22/0x2c0
[552116.858619]  [] ? inet_csk_clone+0x16/0xd0
[552116.858624]  [] ? tcp_create_openreq_child+0x60/0x490
[552116.858627]  [] ? tcp_v4_syn_recv_sock+0x6a/0x310
[552116.858630]  [] ? tcp_check_req+0x249/0x4d0
[552116.858633]  [] ? tcp_v4_do_rcv+0x398/0x470
[552116.858636]  [] ? tcp_v4_rcv+0x52a/0x8d0
[552116.858644]  [] ? bond_start_xmit+0xbb/0x5d0 [bonding]
[552116.858648]  [] ? ip_local_deliver_finish+0xdd/0x2d0
[552116.858651]  [] ? ip_local_deliver+0x98/0xa0
[552116.858653]  [] ? ip_rcv_finish+0x12d/0x440
[552116.858656]  [] ? ip_rcv+0x285/0x370
[552116.858659]  [] ? __netif_receive_skb+0x4bb/0x780
[552116.858662]  [] ? tcp4_gro_receive+0x5a/0xd0
......

It’s hard to understand the “page allocation failure” error because the memory capacity is very big in our servers. By looking at the result “free” command, I noticed that a large mount of memory was used to cache files. Maybe the “free” memory is too small so the kernel could not get enough pages when it need many.
But how to reserve more “free” memory in linux kernel? According to this article，we could modify “/proc/sys/vm/min_free_kbytes” to adjust the watermark of linux-memory-management. And the kernel will try hardly to reserve enough “free” memory:

After changing the “/proc/sys/vm/min_free_kbytes” to 1G, the errors became rare but still exists. Then I change it to 4G, and this time, there wasn’t any errors in dmesg now.
At conclude, the default value of “min_free_kbytes” in kernel is too small, we’d better turn up “min_free_kbytes” in machines with big memory.

Too many “ext4-dio-unwrit” processes in system

After adding pressure to application which will write tremendous data into ext4 file system, we see many “ext4-dio-unwrit” kernel threads in “top” screen. Many guys say this is a normal phenomenon, so I check the source code of ext4 in 2.6.32 linux kernel.
The beginning of writing a file in kernel is write-back kernel thread, it will call generic_writepages() and then ext4_write_page():

ext4_write_page()
    --> ext4_set_bh_endio()
        --> ext4_end_io_buffer_write()
            --> ext4_add_complete_io()

Let’s look at ext4_add_complete_io():

/* Add the io_end to per-inode completed end_io list. */
void ext4_add_complete_io(ext4_io_end_t *io_end)
{
    struct ext4_inode_info *ei = EXT4_I(io_end->inode);
    struct workqueue_struct *wq;
    unsigned long flags;
    BUG_ON(!(io_end->flag & DIO_AIO_UNWRITTEN));
    wq = EXT4_SB(io_end->inode->i_sb)->dio_unwritten_wq;
    spin_lock_irqsave(&ei->i_completed_io_lock, flags);
    if (list_empty(&ei->i_aio_dio_complete_list)) {
        io_end->flag |= DIO_AIO_QUEUED;
        queue_work(wq, &io_end->work);
    }
    list_add_tail(&io_end->list, &ei->i_aio_dio_complete_list);
    spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
}

It will put “io_end” into the work queue “dio_unwritten_wq” which is in the ext4_sb_info. But where does the “dio_unwritten_wq” come from ? In the fs/ext4/super.c:

static int ext4_fill_super(struct super_block *sb, void *data, int silent)
{
......
    EXT4_SB(sb)->dio_unwritten_wq = create_workqueue("ext4-dio-unwritten");
    if (!EXT4_SB(sb)->dio_unwritten_wq) {
        printk(KERN_ERR "EXT4-fs: failed to create DIO workqueue\n");
        goto failed_mount_wq;
    }
......

Oh, it is the “ext4-dio-unwritten” kernel thread! So, the problem is solved: the application dirty the page of file system cache, then the write-back kernel thread will write these dirty pages into specific file system (ext4 in this article), finally ext4 will put io into the work queue “ext4-dio-unwritten” and wait to convert unwritten-extent into written extent of ext4.
Therefore, if we don’t have unwritten-extent in ext4 (just using system-call write() to appending a normal file)，the “ext4-dio-unwritten” kernel threads will exist but not using any CPU.

Why you should update your gcc (and c++ library)

Consider the code below:

#include 
#include 
#include 
using std::string;
struct string_hashfunc {
    size_t operator()(const string *a) const
    {
        size_t hash = 0;
        if(a == NULL) return 0;
        for (string::const_iterator it = a->begin(); it != a->end(); it++) {
            hash = hash * (*it);
        }
    }
};
typedef __gnu_cxx::hash_map StringHash;
int main(void) {
    StringHash sh;
    string *a = new string("earth");
    string *b = new string("sky");
    sh[a] = b;
    for (StringHash::iterator it = sh.begin(); it != sh.end(); it++) {
        delete it->first;
        delete it->second;
    }
}

It could be compiled and run on CentOS 5 (gcc-4.1.2), but will core dump at runtime.

g++ test.cpp -o test -g2 -O0
./test
Segmentation fault

The gdb stack shows the breakpoint is in string_hashfunc::operator():

(gdb) bt
#0  0x00007ffff7b7c1e3 in std::basic_string, std::allocator >::end() const () from /usr/lib64/libstdc++.so.6
#1  0x0000000000400fd0 in string_hashfunc::operator() (this=0x7fffffffe861, a=0x606620) at test.cpp:12
......

Let’s see the source code of “ext/hash_map” in /usr/include/c++/4.1.2/ext/hashtable.h:

  template 
    _Hashtable_iterator<_Val, _Key, _HF, _ExK, _EqK, _All>&
    _Hashtable_iterator<_Val, _Key, _HF, _ExK, _EqK, _All>::
    operator++()
    {
      const _Node* __old = _M_cur;
      _M_cur = _M_cur->_M_next;
      if (!_M_cur)
    {
      size_type __bucket = _M_ht->_M_bkt_num(__old->_M_val);
      while (!_M_cur && ++__bucket < _M_ht->_M_buckets.size())
        _M_cur = _M_ht->_M_buckets[__bucket];
    }
      return *this;
    }

And in the implementation of _M_bkt_num():

      size_type
      _M_bkt_num_key(const key_type& __key) const
      { return _M_bkt_num_key(__key, _M_buckets.size()); }
      size_type
      _M_bkt_num(const value_type& __obj) const
      { return _M_bkt_num_key(_M_get_key(__obj)); }
      size_type
      _M_bkt_num_key(const key_type& __key, size_t __n) const
      { return _M_hash(__key) % __n; }

It use _M_hash() to compute the bucket number of the key, and the _M_hash() is actually string_hashfunc::operator(). The reason is clear now: the iterator want to increase, so it call operator++() –> _M_bkt_num() –> _M_bkt_num_key() –> _M_hash() –> string_hashfunc::operator() and it can’t fetch the key because it has been freed in “delete it->first”.
How about new g++ and new c++ library? Let’s try to write the same program on CentOS 7 (gcc-4.8.5) and change “ext/hash_map” to “unordered_map” (for c++ 11 standard):

#include 
#include 
#include 
using std::string;
struct string_hashfunc {
    size_t operator()(const string *a) const
    {
        size_t hash = 0;
        if(a == NULL) return 0;
        for (string::const_iterator it = a->begin(); it != a->end(); it++) {
            hash = hash * (*it);
        }
    }
};
typedef std::unordered_map StringHash;
int main(void) {
    StringHash sh;
    string *a = new string("earth");
    string *b = new string("sky");
    sh[a] = b;
    for (StringHash::iterator it = sh.begin(); it != sh.end(); it++) {
        delete it->first;
        delete it->second;
    }
}

Then build it:

g++ test.cpp -o test -g2 -O0 -std=c++11
./test

Everything goes normal because the new implementation of c++ library use “_M_nxt” to point to the next hash node instead of using hash function (could see it in /usr/include/c++/4.8.5/bits/hashtable_policy.h).

Why you should update your gcc

Consider this c++ code:

#include 
#include 
#include 
using std::vector;
using std::string;
typedef vector StringVec;
void print(const StringVec &string_vec) {
    for (StringVec::iterator it = string_vec.begin(); it != string_vec.end(); it++) {
        std::cout << *it << std::endl;
    }
}
int main(void) {
    vector string_vec;
    string_vec.push_back("hello");
    string_vec.push_back("world");
    print(string_vec);
}

I compiled it on CentOS-5 on which the version of gcc is 4.1.2 in the first place and it report:

test.cpp: In function ‘void print(const StringVec&)’:
test.cpp:11: error: conversion from ‘__gnu_cxx::__normal_iterator, std::allocator >*, std::vector, std::allocator >, std::allocator, std::allocator > > > >’ to non-scalar type ‘__gnu_cxx::__normal_iterator, std::allocator >*, std::vector, std::allocator >, std::allocator, std::allocator > > > >’ requested

Can anyone find out the problem at first glance of this mess report ? The error report of c++ template is terrible difficult to understand, since I used it 9 years ago.
Then I try to compile the source on CentOS-6 with gcc-4.4.6

test.cpp: In function ‘void print(const StringVec&)’:
test.cpp:11: error: conversion from ‘__gnu_cxx::__normal_iterator, std::allocator >*, std::vector, std::allocator >, std::allocator, std::allocator > > > >’ to non-scalar type ‘__gnu_cxx::__normal_iterator, std::allocator >*, std::vector, std::allocator >, std::allocator, std::allocator > > > >’ requested

Looks almost the same. How about CentOS-7 with gcc-4.8.5

test.cpp: In function ‘void print(const StringVec&)’:
test.cpp:11:52: error: conversion from ‘std::vector >::const_iterator {aka __gnu_cxx::__normal_iterator*, std::vector > >}’ to non-scalar type ‘std::vector >::iterator {aka __gnu_cxx::__normal_iterator*, std::vector > >}’ requested
     for (StringVec::iterator it = string_vec.begin(); it != string_vec.end(); it++) {
                                                    ^

Aha, much better as it tell us the exact position of problem: “vec.begin()” will return a “const_iterator” which is not coherent to “iterator”.
To save your time for debugging c++ template code and enjoy life, please update your gcc.

Robin on Linux

Monthly Archives: March 2016