Partitioning and Bucketing Hive table

In previous article, we use sample datasets to join two tables in Hive. To promote the performance of table join, we could also use Partition or Bucket. Let’s first create a parquet format table with partition and bucket:

Then import data into it:

But it reports error:

All the employees have only two genders: “M” and “F”. How could Hive report “too many dynamic partitions”?
To look for the fundamental cause, I use “explain” before my HQL, and finally noticed by this line:

Hive use “_col4” as partition column and it’s type is DATE! So the correct import HQL should put partition column at last:

We successfully import data by dynamic partitions.

Now we create new parquet format table “salary” (using buckets) and join two tables:

The join operation only cost 90 seconds, much smaller than previous 140 seconds without bucketing and partitioning.

Example datasets for learning Hive

I find two datasets: employee and salary for learning and practicing. After putting two files into HDFS, we just need to create tables:

Now we could analyze the data.

Find the oldest 10 employees.

Find all the employees joined the corporation in January 1990.

Find the top 10 employees earned the highest average salary. Notice we use ‘order by’ here because ‘sort by’ only produce local order in reducer.

Let’s find out whether this corporation has sex discrimination:

The result is:

Looks good 🙂

Use hive to join two datasets

In previous article, I write java code of MapReduce-Framework to join two datasets. Furthermore, I enhanced the code to sort by scores for every student. The complete join-and-sort code is here. It need more than 170 lines of java code to join two tables and sort it. But in product environment, we usually use Hive to do the same work.
By using the same sample datasets:

Now we could join them:

Just three lines of HQL (Hive Query Language), not 170 lines of java code.

These two tables are very small, thus we could use local mode to run Hive task:

Some problems about programming Mapreduce

1. After submitting job, the console report:

The reason is I forgot to setJarByClass():

2. When the job finished, I found the reducer haven’t run at all. The reason is I haven’t override the correct reduce() member function of Reducer so MapReduce Framework ignore it and didn’t report any notification or warning. To make sure we override the correct member function of parent class, we need to add annotation:

Use MapReduce to join two datasets

The two datasets are:

To join the two tables above by “student id”, we need to use MultipleInputs. The code is:

Compile and run it:

And the result in /my is:

Use MapReduce to find prime numbers

Just want to write a small example of MapReduce of Hadoop for finding prime numbers. The first question is: how could I generate numbers from 1 to 1000000 by my own application instead of reading from file of HDFS? The answer is: inherit the InputSplit, RecordReader, and InputFormat by yourself, just like teragen program
Then comes the second question: could I just use mapper without reducer stage? The answer is yes, simply use job.setNumReduceTasks(0) to disable reducer stage.

The complete code is here (I know the algorithm for checking a number for prime is naive, but it works):

Copy the code to file, compile and run it: