Book notes about “Amazon Redshift Database Developer Guide”

Although be already familiar with Cloud Computing for may years, I haven’t look inside many services provided by Amazon Web Service. Because my company (Alibaba) has it’s own cloud platform: Aliyun, so we are only allowed to use home-made cloud products, such as ECS(like EC2 in AWS), RDS(like RDS in AWS), ODPS(like EMR in AWS).

These days I have read some sections of “Amazon Redshift Database Developer Guide” on my Kindle at my commute time.

Amazon Redshift is built on PostgreSQL, which is not very popular in China but pretty famous in Japan and USA. The book said that primary key and foreign key are only used for informal and constrains are totally not supported. I guess Redshift do distributed the rows of every tables into different servers in the cluster therefore keeping constrains is almost impossible.

Columnar Storage is used in Redshift because it is a perfect solution for OLAP (OnLine Analytical Processing) in which situation users tends to retrieve or load tremendous of records. Column-oriented Storage is also suitable for compression and will conserve colossal disk space.

The interesting thing is the architecture of Amazon Redshift and Greenplum looks very similar: both distribute the rows, both use PostgreSQL as back-end engine. Greenplum has open-sourced recently, which make common users to build private OLAP platform much easier. This lead a new question for me: if users could build a private cloud on their bare-metal servers very easily (by the software of OpenStack, OpenShift, Mesos, Greenplum etc.), is it still necessary to build their services and store their data into public cloud? Or the only value of public cloud will be maintaining and managing large mount of bare-metal servers?

Use type of “Any” carefully in Scala

Think about the code below:

I intent to see the compiler error for second ‘increment’ int the first place. But it don’t, the compiler report ok and the output of the program is:

The compiler recognize the three arguments “world”, “6”, “7” as a tuple of (“world”, 6, 7). So the correct type of arugment for function ‘increment’ should be ‘String’:

Usage limitations of HDFS’s C API

I have to change a program which is written by c language from writing local files to writing on HDFS. After learning the example of C API in libhdfs, I complete the modification of open()/write()/read() to hdfsOpenFile()/hdfsWriteFile()/hdfsReadFile() and so on. But when running the new program, many problems occured. The first is: after fork(), I can’t open files of HDFS anymore. And the problem looks very common in community and haven’t any solution yet.
So I have to try the hdfs-fuse tool. According to the steps of this article, I successfully build and run the hdfs-fuse:

But something weird happened:

After fsync(), the size of file “my.db” is still zero by “ls” command on mountpoint “/data”! It cause the program report error and can’t continue to process.
The reason is fuse-dfs haven’t implement fuse_fsync() interface. After adding the implementation of fuse_fsync() by hdfsHSync(), it works now. But the performance is too bad: about 10~20MB/s in network.
Consequently, I decided to use glusterfs instead of HDFS because it totally don’t need any modification for user program and support erasure-code since version 3.6 (this will dramatically reduce occupation of storage space).

Network problem of installing docker-engine on Centos-7

After installing docker-engine on Centos-7, it failed to start by

After I use

it shows:

To see more details, I use

There were much more informations now:

The ethernet interface of docker0 can not be assigned IPv4 addresse. It seems many people meet this problem after I searching the google, but none of the solutions works for me. Consequently, I find out this solution in my environment (totally remove the network bridge):

and my docker service startup now!

The key is: docker-engine will not assign IP to docker0 interface which is already exists.