Monthly Archives: June 2016

Wrong ‘struct timeval’ for setsockopt()

What if we deliberately use ‘struct timeval’ like this incorrect way to set timeout of receiving to 3 seconds:

struct timeval tv = {1, 2000000};
setsockopt(fd,SOL_SOCKET,SO_RCVTIMEO,&tv,sizeof(tv));

the ‘setsockopt’ will return fail (-1).
Let’s look up the linux kernel code for systemcall sys_setsockopt():

SYSCALL_DEFINE5(setsockopt, int, fd, int, level, int, optname,
        char __user *, optval, int, optlen)
{
    int err, fput_needed;
    struct socket *sock;
    if (optlen < 0)
        return -EINVAL;
    sock = sockfd_lookup_light(fd, &err, &fput_needed);
    if (sock != NULL) {
        err = security_socket_setsockopt(sock, level, optname);
        if (err)
            goto out_put;
        if (level == SOL_SOCKET)
            err =
                sock_setsockopt(sock, level, optname, optval,
                        optlen);
        else
            err =
                sock->ops->setsockopt(sock, level, optname, optval,
                          optlen);
out_put:
        fput_light(sock->file, fput_needed);
    }
    return err;
}

sock_setsockopt() will invoke sock_set_timeout() and sock_set_timeout() looks like:

static int sock_set_timeout(long *timeo_p, char __user *optval, int optlen)
{
    struct timeval tv;
    if (optlen < sizeof(tv))
        return -EINVAL;
    if (copy_from_user(&tv, optval, sizeof(tv)))
        return -EFAULT;
    if (tv.tv_usec < 0 || tv.tv_usec >= USEC_PER_SEC)
        return -EDOM;
......

That’s it. If ‘tv.tv_usec’ is greater than USEC_PER_SEC (which equals 1000000), it will return -EDOM and setsockopt() will fail.

DCTC 2016 conference

Yesterday I went to attend DCTC(Data Center Technology Conference) 2016 in Beijing. Although it is called “Data Center Technology”, most topics is about storage, because the conference is hold by Memblaze, a famous flash-storage startup company in China.
Xuebing Yin, The CEO of Memblaze, gave the first topic:

2016 is a important year for flash-storage because the revenue of SSD become bigger than Hard-disk for the first time. As we can see, data center will become full silicon (Hard-Disk is the only non-silicon component in servers) in the near future. As the speed of SSD and its interface (from SATA to PCIE) become faster and faster, many old softwares become bottleneck of performance: mysql 5.6 can’t use up the performance of high-speed SSD, but mysql 5.7 could.
Janene Ellefson from NVME organization introduced why we need a standard for high-speed data transfer.

Could use up to 64K queues with 64K command for each queue, NMVE is definitely the most powerful protocol for modern (or future) IO devices.
Xin Wu from GBase introduced the problems they face in using SSD for database

GBase is a serial of database products for OLTP/OLAP and global data storage. The SSD benefited the OLTP application, but for OLAP application the SSD is too expensive because the Hard-disk Array could also provide the same bandwidth. Maybe that’s why AWS released new type of EBS months ago.
Coly Li (Yes, my old friend ^_^) from SUSE Labs showed us the improvements of Linux Soft RAID in recent years. Many years ago, Linux soft RAID was used only for low speed Hard Disks, so the cost of bad software implement is not significant. But recently, the widely use of SSD expose many bottlenecks in soft RAID, and the developers of open source community commit many patches to improve performance. And, many patches came from Shaohua Li, a sophisticate kernel developer.(He worked first for Intel, and then Fusionio, and now Facebook).

In the tea break, I visited the exhibition of Memblaze.

This is a 1U server built by SuperMicro, it use eight NVME SSD by SFF8639 interfaces in the front. Running Mysql at full speed (4600% CPU usage), the wind run out from back is still not hot. Looks the server of SuperMicro is very effective, and cool 🙂

Some tips about Hive

Found some tips about Hive in my learning progress:
1. When I start “bin/hive” at first time, these errors report:

Exception in thread "main" java.lang.RuntimeException: Hive metastore database is not initialized. Please use schematool (e.g. ./schematool -initSchema -dbType ...) to create the schema. If needed, don't forget to include the option to auto-create the underlying database in your JDBC connection string (e.g. ?createDatabaseIfNotExist=true for mysql)

The solution is simple:

mv metastore_db metastore_db.tmp
schematool -initSchema -dbType derby

Actually, we’d better use mysql instead of derby for multi-users environment.
2. Control the number of mappers for SQL jobs. If a SQL job use too much mappers, the context-switch of processes (include frequent launch/stop for JVM) will cost extra CPU resource. We could use

set mapreduce.input.fileinputformat.split.maxsize=...
set mapreduce.input.fileinputformat.split.minsize=...

to change the number of mappers for all the SQL jobs.
3. After I imported 1TB data into a “Orc format” table, the size of the table is just 250GB. But after I imported 1TB data into a “Parquet format” table, the size is 900GB. Looks Apache Orc has more effective compression algorithm for custom data.
4. Using partitions carefully.

create table users (name string, age smallint) partitioned by (ca string);

Now we have a table named “users” and is partitioned by field “ca”.

hive> insert into users values("robindong", 36);
FAILED: SemanticException 1:12 Need to specify partition columns because the destination table is partitioned. Error encountered near token 'users'
hive> insert into users values("robindong", 36, "China");
FAILED: SemanticException 1:12 Need to specify partition columns because the destination table is partitioned. Error encountered near token 'users'

We can’t using normal INSERT clause to insert record into partitioned table. Trying this:

insert into users partition (ca="China") values("robindong", 36);

Now, there is a record in HDFS directory “/user/hive/warehouse/users/ca=China/”
In the book “Programming Hive”, it said we could copy the data in a partition directory to AWS s3 and then set partition to it. But, what if I set the partition to a new empty HDFS directory? Let’s try:

alter table users partition(ca = 'China') set location '/empty/';
hive> select * from users where ca='China';
OK
Time taken: 0.298 seconds

Because the partition has been set to a empty directory, the select couldn’t find any records now. That is what “Schema on read” mean.
5. Debug.

bin/hive --hiveconf hive.root.logger=DEBUG,console

This will print many debug information for finding causes such as:

aused by: java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at org.apache.hive.common.util.ReflectionUtil.setJobConf(ReflectionUtil.java:112)
        ... 20 more
Caused by: java.lang.IllegalArgumentException: Compression codec com.hadoop.compression.lzo.LzoCodec not found.
        at org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:139)
        at org.apache.hadoop.io.compress.CompressionCodecFactory.(CompressionCodecFactory.java:179)
        at org.apache.hadoop.mapred.TextInputFormat.configure(TextInputFormat.java:45)
        ... 25 more
Caused by: java.lang.ClassNotFoundException: Class com.hadoop.compression.lzo.LzoCodec not found
        at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)
        at org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:132)
        ... 27 more

Robin on Linux

Monthly Archives: June 2016