Monthly Archives: July 2016

Partitioning and Bucketing Hive table

In previous article, we use sample datasets to join two tables in Hive. To promote the performance of table join, we could also use Partition or Bucket. Let’s first create a parquet format table with partition and bucket:

Then import data into it:

But it reports error:

Read more »

Example datasets for learning Hive

I find two datasets: employee and salary for learning and practicing. After putting two files into HDFS, we just need to create tables:

Now we could analyze the data. Find the oldest 10 employees.

Find all the employees joined the corporation in January 1990.

Find the top… Read more »

Some problems about programming Mapreduce

1. After submitting job, the console report:

The reason is I forgot to setJarByClass():

2. When the job finished, I found the reducer haven’t run at all. The reason is I haven’t override the correct reduce() member function of Reducer so MapReduce Framework ignore it and didn’t report… Read more »

Use MapReduce to join two datasets

The two datasets are:

To join the two tables above by “student id”, we need to use MultipleInputs. The code is:

Compile and run it:

And the result in /my is: