Enable audit log for AWS Redshift

When I was trying to enable the Audit Log for AWS Redshift, I chose to use a exists bucket in S3. But it report error:

"Cannot read ACLs of bucket redshift-robin. Please ensure that your IAM permissions are set up correctly."
"Service: AmazonRedshift; Status Code: 400; Error Code: InsufficientS3BucketPolicyFault ...."




According to this document, I need to change permission of bucket "redshift-robin". So I entered the AWS Console of S3, click bucket name of "redshift-robin" in left panel, and saw description of permissions:



Press "Add Bucket Policy", and in the pop-out-window, press "AWS Policy Generator". Here came the generator, which is easy to use for creating policy.
Add two policy for "redshift-robin":


The "902366379725" is the account-id of us-west-2 region (Oregon)

Click "Generate Policy", and copy the generated JSON to "Bucket Policy Editor":



Press "Save". Now, we could enable Audit Log of Redshift for bucket "redshift-robin":


DCTC 2016 conference

Yesterday I went to attend DCTC(Data Center Technology Conference) 2016 in Beijing. Although it is called “Data Center Technology”, most topics is about storage, because the conference is hold by Memblaze, a famous flash-storage startup company in China.
Xuebing Yin, The CEO of Memblaze, gave the first topic:




2016 is a important year for flash-storage because the revenue of SSD become bigger than Hard-disk for the first time. As we can see, data center will become full silicon (Hard-Disk is the only non-silicon component in servers) in the near future. As the speed of SSD and its interface (from SATA to PCIE) become faster and faster, many old softwares become bottleneck of performance: mysql 5.6 can’t use up the performance of high-speed SSD, but mysql 5.7 could.

Janene Ellefson from NVME organization introduced why we need a standard for high-speed data transfer.




Could use up to 64K queues with 64K command for each queue, NMVE is definitely the most powerful protocol for modern (or future) IO devices.

Xin Wu from GBase introduced the problems they face in using SSD for database




GBase is a serial of database products for OLTP/OLAP and global data storage. The SSD benefited the OLTP application, but for OLAP application the SSD is too expensive because the Hard-disk Array could also provide the same bandwidth. Maybe that’s why AWS released new type of EBS months ago.

Coly Li (Yes, my old friend ^_^) from SUSE Labs showed us the improvements of Linux Soft RAID in recent years. Many years ago, Linux soft RAID was used only for low speed Hard Disks, so the cost of bad software implement is not significant. But recently, the widely use of SSD expose many bottlenecks in soft RAID, and the developers of open source community commit many patches to improve performance. And, many patches came from Shaohua Li, a sophisticate kernel developer.(He worked first for Intel, and then Fusionio, and now Facebook).



In the tea break, I visited the exhibition of Memblaze.




This is a 1U server built by SuperMicro, it use eight NVME SSD by SFF8639 interfaces in the front. Running Mysql at full speed (4600% CPU usage), the wind run out from back is still not hot. Looks the server of SuperMicro is very effective, and cool 🙂

C++/Java developers needed

I worked in Alibaba Group for more than 9 years. Recently I am working in Alimama, a sub-company of Alibaba Group and has been the biggest Advertisement Publishing Company in China. At present, we need C++/Java developers to build new back-end basic services for our new business.

[Job Description]

Role: C++/Java Developer for storage system or high performance computing

Location: Beijing

Your responsibilities:

1. Building and optimizing the distributed key-value storage system
2. Building and optimizing the distributed computing engine of Linear Regression algorithm
3. Building and maintaining the backend service for Advertisement Publishing System

Skins & experience required:

1. Familiar with storage system or hight performance computing system
2. Strong background about Redis/Rocksdb/Hadoop/Glusterfs
3. Very familiar with one of C/C++/Java/Scala language
4. More than 3 years experience about storage system or HPC as a developer
5. Passionate about new Technologies and wanting to continuously push the boundaries

Any one who is interesting in the job above could send email to my email: haodong@alibaba-inc.com

Book notes about “Amazon Redshift Database Developer Guide”

Although be already familiar with Cloud Computing for may years, I haven’t look inside many services provided by Amazon Web Service. Because my company (Alibaba) has it’s own cloud platform: Aliyun, so we are only allowed to use home-made cloud products, such as ECS(like EC2 in AWS), RDS(like RDS in AWS), ODPS(like EMR in AWS).

These days I have read some sections of “Amazon Redshift Database Developer Guide” on my Kindle at my commute time.

Amazon Redshift is built on PostgreSQL, which is not very popular in China but pretty famous in Japan and USA. The book said that primary key and foreign key are only used for informal and constrains are totally not supported. I guess Redshift do distributed the rows of every tables into different servers in the cluster therefore keeping constrains is almost impossible.

Columnar Storage is used in Redshift because it is a perfect solution for OLAP (OnLine Analytical Processing) in which situation users tends to retrieve or load tremendous of records. Column-oriented Storage is also suitable for compression and will conserve colossal disk space.

The interesting thing is the architecture of Amazon Redshift and Greenplum looks very similar: both distribute the rows, both use PostgreSQL as back-end engine. Greenplum has open-sourced recently, which make common users to build private OLAP platform much easier. This lead a new question for me: if users could build a private cloud on their bare-metal servers very easily (by the software of OpenStack, OpenShift, Mesos, Greenplum etc.), is it still necessary to build their services and store their data into public cloud? Or the only value of public cloud will be maintaining and managing large mount of bare-metal servers?

China Linux Storage & Filesystem 2015 workshop (second day)

Zheng Liu from Alibaba lead the topic about ext4. The most important change in EXT-series filesystem this year is: ext3 has gone, people could only use ext3 by mount ext4 with special arguments in latest kernel (actually, in CentOS 7.0). Encrypt feature has complete in ext4.

Robin Dong (Yes, it’s me) from Alibaba give a presentation about cold storage (Slide is here). We develop distributed storage system based on a small open-source software called “sheepdog“, and modified it heavily to improve data recovery performance and make sure it could run in low-end but high-density storage servers.


clsf2015
Discussion in tea break

Yanhai Zhu from Alibaba (We have done so much works on storage) lead a topic about cache in virtual machines environment. Alibaba choose Bcache as code base to develop a new cache software.

Robin: Why Bcache? Why not flashcache?
Yanhai: I started my work on flashcache first, but flashcache is not profit to the product environment. First, flashcache is unfriendly to sequential-write. Second, it use hash data structure to distributed IO requests at beginning, which will split the cache data in multi-tenant environment. Bcache use B-tree instead of hash-table to store data, it’s better for our requirements.

They use radical write-back strategy on cache. It works very well because the cache sequentialize the write IOs and make backend easy to absorb the pressure peak.

The last topic is lead by Zhongjie Wu from Memblaze, a famous startup company in China on flash storage technology. It’s about NVDIMM, the most hot hardware technology in recent years. A NVDIMM is not expensive, it is only a DDR DIMM with a capacitance. Memblaze has develop a new 1U storage server with a NVDIMM and many flash cards. It contain their own developed OS and could use Fabric-Channel/Ethernet to connect to client. The main purpose of NVDIMM is to reduce latency, and they use write-back strategy(Surely).

The big problem they face with NVDIMM is CPU can’t flush data in its L1 cache to NVDIMM when whole server powers down. To solve this problem, Memblaze use write-combining in CPU multi-cores, it hurts the performance a little but avoid the data missing finally.


clsf2015
All the staff in this CLSF 2015

Articles from other attenders:

https://blogs.oracle.com/linuxkernel/entry/china_linux_storage_and_file1

China Linux Storage & Filesystem 2015 workshop (first day)

The first topic is lead by Haomai Wang from XSKY. He first introduce some basic concepts about Ceph and I did catch this opportunity to ask some questions.

Robin(from Alibaba): Dose Ceph cache all meta-data information (may called “cluster map”) on monitor-nodes so client could fetch data by just one jump in network?
Haomai: Yes. One jump, and comes to the OSD.
Robin: If I use Cephfs, is it still one jump?
Haomai: Still one jump. Although we add MDS in Cephfs, the MDS does not store data or meta-data of filesystem but only to store the context of distributed lock.

Ceph also support samba2.0/3.0 now. In linux, it is recommend to use iSCSI to access ceph storage cluster because it will have to update kernel in clients if we use rbd/libceph kernel modules. Ceph use pipeline model in message processing therefore it is good to Hard Disk but not SSD. In the future, developers will use async-framework (such as Seastar) to refactor the ceph.

Robin: If I use three replications in ceph, will the client write three copies concurrently?
Haomai: No. Firstly, the IO will come to the primary OSD, and then the primary OSD will issue two other replicated IOs to other two OSDs, waiting until the two IOs back, and return to client “the IO is success”.
Robin: Ok, now we still have two jumps….Is it difficult to change OSD to write at the same time so we can make the latency of ceph low?
Haomai: That will not be easy. Ceph use primary OSD to make sure the consistent of writing transaction.

The future developing plan for ceph is de-duplication on pool level. Coly Li(from Suse) said that de-duplication is better to be made on business level instead of block level because the duplicated information has be split in block level. But the developers in ceph community looks still want make ceph to be omnipotent.


ceph
Discussion about ceph

Jiaju Zhang from Redhat lead the topic about use cases of ceph in enterprises. Ceph has become the most famous open source storage software around the world and also be used in Redhat/Intel/Sandisk(Low-end Storage Array)/Samsung/Suse.

Next topic is about ScyllaDB and Seastar. Asias He from OSv lead this topic. ScyllaDB is a distributed Key/Value store engine which is written in C++14 code and completely compatible to Cassandra. It could also run CQL (Cassandra Query Language). In the graph, ScyllaDB is 40 times more faster than Cassandra. The asynchronous developing framework in ScyllaDB is called Seastar.

Robin: What’s the magic in ScyllaDB?
Asias: We shard requests to every CPU core, and run with no locks/no threads. Data is zero-copy and use bi-direction queue to transfer messages between cores. The test result is base on kernel TCP/IP network stack but we will use our own network stack in our future.

Yanhai Zhu(from Alibaba): I think the test you do is not fair enough: ScyllaDB is designed to be run in multi-cores but Cassandra is not. You guys should run 24 Cassandra instances to compare with ScyllaDB, not just one.
Asias: May be you are right. But ScyllaDB use message queues to transfer messages between CPU cores, so it avoid atomic-operation and lock-operation cost. And, Cassandra is written by Java, which means the performance will be low when the JVM do garbage- collection. ScyllaDB is written completely by c++ so its performance is much steady.

Last topic today is lead by Xu Wang, the CTO of Hyper ( A startup company in china, works on how to run container like VM).
Hyper means “hypervisor” addd “docker image”. Customers could run docker image on Xen/KVM/Virtualbox now.


clsf2015
The guy on the right side is Xu Wang

Vault Conference 2015, in Boston

Vault Conference 2015


vault
Registration

vault
First talk, and the sponsors

vault
John Spray and his talk about development updates of Ceph

vault
I and Coly also give a presentation about our Cold Storage project, the slide is here: Lambert: Achieve High Durability, Low Cost & Flexibility at Same Time — open source cold storage engine for ExaBytes data in Alibaba

vault
I am appreciating the help from open source community, seems my oral English is not bad 🙂

China Linux Storage & Filesystem 2014 workshop (second day)

The first topic in second day of CLSF 2014 is about NFS, which lead by Tao Peng from PrimaryData. The protocol of NFS is updated to 4.2 and the mainly jobs of NFS community is implement many features (such as “server side copy”) which had been used on local file system onto server side.

nfs/

Then the distributed software developer in Xiaomi —— Liang Xie introduce the basic infrastructure of Xiaomi Cloud Storage and report some severe problems about IO stack. The first problem is that heavy write pressure will cause long latency on ext4 file system, and the latency will be short if the local file system is replaced by xfs.

Zheng Liu (from Alibaba): The implement of journal in xfs is better than ext4. When large mount of write operations come to ext4, it have to write checkpoint in journal and flush disk, which may take a long time. I think you could try ‘no journal’ mode which developed by google guys in ext4.

Another problem is Xiaomi want to use io-scheduler of deadline but they can’t use cgroup by ‘deadline’.

Coly Li (from Alibaba): I suggest that you could try tpps which is a simple and effective io-scheduler in ali_kernel.

Next topic about ext4 is hold by Zheng Liu. In this year, ext4 has add no new features (may be that’s why it is so stable). In google’s chrome OS, they want to store something like cookie for every user, so it need to add encryption feature in ext4. We ask why chrome os not using encryptfs on ext4. The answer of Zheng Liu is: the requirement is came from google itself, so no one knows why. Ext4 also add a new option “sparse_super2” (to store super block only in the beginning and the end of ext4 groups) and “packed_meta_block” (to squeeze all meta data of ext4 into the beginning of disk, mainly for the SMR disk).

ext4

The last topic is about OSv, the most popular and cutting-edge OS in this conference. OSv is a operating system based on virtual machine and cloud environment. It reduce the IO and network stack which makes it very fast and effective. The jvm and some scripts language (such as python, ruby, node.js) could already run on OSv, therefore I consider that it has wined a large part of cloud market for it could run hadoop/spark and many front-end web application.

OSv
OSv
OSv

China Linux Storage & Filesystem 2014 workshop (first day)

CLSF (China Linux Storage & File System Workshop) is an effort to make local Linux kernel hackers get together to share and exchange ideas. CLSF is an invitation only workshop, in order to have effective communication, only a small group of people will be invited. Most of the invitees are active upstream Linux kernel developers locally from China, focus on I/O and storage sub-systems.

CLSF 2014 was hold in office of XiaoMi which is a famous consumer electronics company in china. Participators are mainly from Huawei, Fujitsu, Intel, Alibaba and other companies.

The first topic lead by Jiufei Xue from Huawei is about ocfs2. Huawei was building their private cloud product on ocfs2, so in recent two years the kernel developers in Huawei commited many fix patches and new features into ocfs2 community. In this year, they add range lock into ocfs2, so users could not only lock the whole file but a specific range of one file, which will promote the performance in cluster when many clients read/write files at the same time.

ocfs2

F2FS is a new filesystem in kernel. It is based on devices like SD-card. It stores meta-data in the beginning of device (The random read/write performance in the beginning of SD-card is very well) and use indirect-block layout just like ext2. The reason of not using extent (like ext4) is for its convenience of garbage collection of NAND block. F2FS also merges many ‘sync’ operations into one, mainly for the speed (The ‘sync’ operation on SD-card in mobile-phone is very slow).

Zeifan Li (from Huawei): If one day, the SD-card adds FTL layer into its controller and let SD-card works just like enterprise SSD. Will the advantage of F2FS be disappear?
Ying Huang (from Intel): Let’s look at this another way. If F2FS do the whole work of SSD firmware, we could use kernel file system on NAND flash directly, which will save a lot of money 🙂

f2f2

Bo Liu from Oracle hold the topic about btrfs. The new features for btrfs decreased in this year, and the main work is fixing bugs. Using the standard kernel worker thread in btrfs cause a serious bug which would lost user’s data. Bo Liu spends a long time to repeat the bug, and fortunately, fix it at last.

btrfs

Coly (from Alibaba): I have joined the linux plumber conference in this year. On a presentation about docker, the speaker complained that btrfs is the most unstable filesystem in linux kernel. The developers in btrfs community may have to spend more time on fixing bugs.

The last topic in first day is hold by an engineer from Memblaze (a china-version fusion-io company). They face many problmes in building All-Flash-Array (called AFA). In AFA, the linux system in box has become bottleneck in the whole IO path: too much interrupts cost too much CPU and time; the implemention of socket and TCP/IP stack in linux is too insufficient; context switch for processes also make IOPS unstable; even filesystem itself spend too much time in searching files.

memblaze

Besides these problems, they also speak out a new viewpoint: SSD will become a perfect device to store cold data as NVME and PCM will become heap very quickly. For this point, we began a long and heated discussion. If PCM comes into the market, the Block-layer and filesystem in linux kernel will become obsolete, and almost all the guys sit in the meeting-room will be out of work :). So, maybe the time has come closely for the big chang in storage area.

clsf 2014