Partitioning and Bucketing Hive table

In previous article, we use sample datasets to join two tables in Hive. To promote the performance of table join, we could also use Partition or Bucket. Let’s first create a parquet format table with partition and bucket:

Then import data into it:

But it reports error:

All the employees have only two genders: “M” and “F”. How could Hive report “too many dynamic partitions”?
To look for the fundamental cause, I use “explain” before my HQL, and finally noticed by this line:

Hive use “_col4” as partition column and it’s type is DATE! So the correct import HQL should put partition column at last:

We successfully import data by dynamic partitions.

Now we create new parquet format table “salary” (using buckets) and join two tables:

The join operation only cost 90 seconds, much smaller than previous 140 seconds without bucketing and partitioning.

Leave a Reply

Your email address will not be published. Required fields are marked *

*
*