Books I read in year 2016

Here comes the last day of 2016 year. And it is also the time for me to review my harvest about knowledge, or books.

Frankly speaking, the book “All hard thing about hard things” literally frighten me, and cause me to give up any idea about joining a startup company in China. Maybe this is the best consequence, for many startup companies failed in this end of year and I fortunately avoid this tempest.

Diving more deeper into the ocean of “Hadoop Ecosystem”, or “Big Data”, I find out Spark is really a convenient and powerful framework (compare to MapReduce) which could implement complicated algorithm or data-flow with a few lines of code. Surely, Scala is also a key element for Spark’s efficiency and concision.

Today, even normal person could imagine a sci-fi story about how modern people will fight with Alien invaders. But, what will happen if Aliens attacked the earth in the ancient time? What about Medieval age? Then comes the funny and bold sci-fi novel “The High Crusade”. A group of Medieval army defeat the invader of Alien, and did even more: occupied a frontline planet of a gigantic Alien Empire. It is really out of my imagination 🙂

Build dataflow to get monthly top price of Land Trading in UK

The dataset is downloaded from UK government data web(The total data size is more than 3GB). And, I am using Apache Oozie to run Hive and Sqoop job periodically.

The Hive script “land_price.hql”:

We want Hive job to run on queue “root.default” in YARN (and other jobs in “root.mr”), so we set the “mapred.job.queue.name” to “root.default”.

Remember to use SUBSTR() in Hive to erase quote charactor “\”” when importing data from raw CSV file.

The “coordinator.xml” for Apache Oozie:

The “workflow.xml” for Apache Oozie:

We run two jobs parallelly here: Hive and TeraSort (TeraSort is not useful in real productive environment, but it could be a good substitute for real private job in my company).

The sqoop once report error “javax.xml.parsers.ParserConfigurationException: Feature ‘http://apache.org/xml/features/xinclude’ is not recognized”.
The solution is change file “/usr/lib/hadoop/bin/hadoop” like:

“job.properties” for Oozie:

Remember to set “oozie.use.system.libpath=true” therefore Oozie could run Hive and Sqoop job correctly.

The script to create MYSQL table:

After launch the Oozie coordinator, it will finally put consequent data into MYSQL table:


MYSQL

Looks the land price of “WOKINGHAM” in October 2015 is extremely expensive.

Problem about running Hive-2.0.1 on Spark-1.6.2

When I launched Hive-2.0.1 on Spark-1.6.2, it report errors:

After changed “spark.master” from “yarn-cluster” to “local” and add “–hiveconf hive.root.logger=DEBUG,console” to hive command, it printed out details like:

This article suggest replacing fasterxml.jackson package with newer version, but the problem remained the same even after I completed the replacement.
Then I found the [HIVE-13301] in JIRA:


This explains everything clearly: Hive was using jackson-databind-2.1.1 in calcite package instead of lib/jackson-databind-2.4.2.jar, therefore updating it has no effect.
Thus, we should remove shaded jackson-databind-2.1.1 in calcite-avatica-1.5.0.jar:

The Hive uses lib/jackson-databind-2.4.2.jar and runs correctly now.

“database is locked” in Hue

After launching a long-time HiveQL in SQL Editors of Hue, a small exceptional tip appears under the editor “database is locked”. The solution is to make Hue use Mysql instead of sqlite3. But I am using Hue directly got from github, not Cloudera Release version. So the correct steps should be:

  1. Stop Hue server
  2. Install Mysql and create database ‘hue’
  3. Edit desktop/conf/pseudo-distributed.ini, add these in “[[database]]” section:
  4. Run “make apps” (This is the most important step, as it will install Mysql connector/packages automatically and create meta tables in ‘hue’ database)
  5. Start Hue server

Now we can run long-time query and there will be no error.

Hue

Deploy Hive on Spark

The Mapreduce framework is too small for realtime analytic query, so we need to change engine of Hive from “mr” to “spark” (link):

1. set environment for spark:

2. copy configuration xml file for Hive:

and change these configuration items:

Notice: remember to replace all “${system:java.io.tmpdir}/${system:user.name}” in hive-site.xml to “/tmp/my/” (link)

Partitioning and Bucketing Hive table

In previous article, we use sample datasets to join two tables in Hive. To promote the performance of table join, we could also use Partition or Bucket. Let’s first create a parquet format table with partition and bucket:

Then import data into it:

But it reports error:

All the employees have only two genders: “M” and “F”. How could Hive report “too many dynamic partitions”?
To look for the fundamental cause, I use “explain” before my HQL, and finally noticed by this line:

Hive use “_col4” as partition column and it’s type is DATE! So the correct import HQL should put partition column at last:

We successfully import data by dynamic partitions.

Now we create new parquet format table “salary” (using buckets) and join two tables:

The join operation only cost 90 seconds, much smaller than previous 140 seconds without bucketing and partitioning.

Example datasets for learning Hive

I find two datasets: employee and salary for learning and practicing. After putting two files into HDFS, we just need to create tables:

Now we could analyze the data.

Find the oldest 10 employees.

Find all the employees joined the corporation in January 1990.

Find the top 10 employees earned the highest average salary. Notice we use ‘order by’ here because ‘sort by’ only produce local order in reducer.

Let’s find out whether this corporation has sex discrimination:

The result is:

Looks good 🙂

Use hive to join two datasets

In previous article, I write java code of MapReduce-Framework to join two datasets. Furthermore, I enhanced the code to sort by scores for every student. The complete join-and-sort code is here. It need more than 170 lines of java code to join two tables and sort it. But in product environment, we usually use Hive to do the same work.
By using the same sample datasets:

Now we could join them:

Just three lines of HQL (Hive Query Language), not 170 lines of java code.

These two tables are very small, thus we could use local mode to run Hive task:

Some tips about Hive

Found some tips about Hive in my learning progress:

1. When I start “bin/hive” at first time, these errors report:

The solution is simple:

Actually, we’d better use mysql instead of derby for multi-users environment.

2. Control the number of mappers for SQL jobs. If a SQL job use too much mappers, the context-switch of processes (include frequent launch/stop for JVM) will cost extra CPU resource. We could use

to change the number of mappers for all the SQL jobs.

3. After I imported 1TB data into a “Orc format” table, the size of the table is just 250GB. But after I imported 1TB data into a “Parquet format” table, the size is 900GB. Looks Apache Orc has more effective compression algorithm for custom data.

4. Using partitions carefully.

Now we have a table named “users” and is partitioned by field “ca”.

Now, there is a record in HDFS directory “/user/hive/warehouse/users/ca=China/”
In the book <>, it said we could copy the data in a partition directory to AWS s3 and then set partition to it. But, what if I set the partition to a new empty HDFS directory? Let’s try:

Because the partition has been set to a empty directory, the select couldn’t find any records now. That is what “Schema on read” mean.

5. Debug.

This will print many debug information for finding causes such as: