partition – Robin on Linux

Partitioning and Bucketing Hive table

In previous article, we use sample datasets to join two tables in Hive. To promote the performance of table join, we could also use Partition or Bucket. Let’s first create a parquet format table with partition and bucket:

CREATE TABLE employee_p (
    employee_id INT,
    birthday DATE,
    first_name STRING,
    family_name STRING,
    work_day DATE)
PARTITIONED BY (gender CHAR(1))
CLUSTERED BY (employee_id) INTO 8 BUCKETS
STORED AS PARQUET;

Then import data into it:

SET hive.exec.dynamic.partition.mode=nonstrict;
INSERT OVERWRITE TABLE employee_p PARTITION(gender)
                SELECT employee_id, birthday, first_name, family_name, gender, work_day
                  FROM employee;

But it reports error:

Diagnostic Messages for this Task:
Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{},"value":{"_col0":"499795
","_col1":"1961-07-05","_col2":"Idoia","_col3":"Riefers","_col4":"M","_col5":"1986-01-16"}}
        at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:257)
        at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{},"value":{"_col0":"499795","_col1":"1961-07-05","
_col2":"Idoia","_col3":"Riefers","_col4":"M","_col5":"1986-01-16"}}
        at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:245)
        ... 7 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveFatalException: [Error 20004]: Fatal error occurred when node tried to create too many dynamic partitions. The maximu
m number of dynamic partitions is controlled by hive.exec.max.dynamic.partitions and hive.exec.max.dynamic.partitions.pernode. Maximum was set to: 100
        at org.apache.hadoop.hive.ql.exec.FileSinkOperator.getDynOutPaths(FileSinkOperator.java:922)
        at org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:699)
        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837)
        at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:97)
        at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:236)
        ... 7 more

All the employees have only two genders: “M” and “F”. How could Hive report “too many dynamic partitions”?
To look for the fundamental cause, I use “explain” before my HQL, and finally noticed by this line:

expressions: UDFToInteger(VALUE._col0) (type: int), CAST( VALUE._col1 AS DATE) (type: date), VALUE._col2 (type: string), VALUE._col3 (type: string), CAST( VALUE._col4 AS DATE) (type: date), VALUE._col5 (type: string)

Hive use “_col4” as partition column and it’s type is DATE! So the correct import HQL should put partition column at last:

INSERT OVERWRITE TABLE employee_p PARTITION(gender)
                SELECT employee_id, birthday, first_name, family_name, work_day, gender
                  FROM employee;

We successfully import data by dynamic partitions.
Now we create new parquet format table “salary” (using buckets) and join two tables:

CREATE TABLE salary_p (
    employee_id INT,
    salary INT,
    start_date DATE,
    end_date DATE)
CLUSTERED BY (employee_id) INTO 8 BUCKETS
STORED AS PARQUET;
INSERT OVERWRITE TABLE salary_p SELECT * FROM salary;
/* Join two tables */
SELECT  e.gender, AVG(s.salary) AS avg_salary
  FROM  employee_p AS e
        JOIN  salary_p as s
        ON (e.employee_id == s.employee_id)
  GROUP BY e.gender;

The join operation only cost 90 seconds, much smaller than previous 140 seconds without bucketing and partitioning.

Some tips about Hive

Found some tips about Hive in my learning progress:
1. When I start “bin/hive” at first time, these errors report:

Exception in thread "main" java.lang.RuntimeException: Hive metastore database is not initialized. Please use schematool (e.g. ./schematool -initSchema -dbType ...) to create the schema. If needed, don't forget to include the option to auto-create the underlying database in your JDBC connection string (e.g. ?createDatabaseIfNotExist=true for mysql)

The solution is simple:

mv metastore_db metastore_db.tmp
schematool -initSchema -dbType derby

Actually, we’d better use mysql instead of derby for multi-users environment.
2. Control the number of mappers for SQL jobs. If a SQL job use too much mappers, the context-switch of processes (include frequent launch/stop for JVM) will cost extra CPU resource. We could use

set mapreduce.input.fileinputformat.split.maxsize=...
set mapreduce.input.fileinputformat.split.minsize=...

to change the number of mappers for all the SQL jobs.
3. After I imported 1TB data into a “Orc format” table, the size of the table is just 250GB. But after I imported 1TB data into a “Parquet format” table, the size is 900GB. Looks Apache Orc has more effective compression algorithm for custom data.
4. Using partitions carefully.

create table users (name string, age smallint) partitioned by (ca string);

Now we have a table named “users” and is partitioned by field “ca”.

hive> insert into users values("robindong", 36);
FAILED: SemanticException 1:12 Need to specify partition columns because the destination table is partitioned. Error encountered near token 'users'
hive> insert into users values("robindong", 36, "China");
FAILED: SemanticException 1:12 Need to specify partition columns because the destination table is partitioned. Error encountered near token 'users'

We can’t using normal INSERT clause to insert record into partitioned table. Trying this:

insert into users partition (ca="China") values("robindong", 36);

Now, there is a record in HDFS directory “/user/hive/warehouse/users/ca=China/”
In the book “Programming Hive”, it said we could copy the data in a partition directory to AWS s3 and then set partition to it. But, what if I set the partition to a new empty HDFS directory? Let’s try:

alter table users partition(ca = 'China') set location '/empty/';
hive> select * from users where ca='China';
OK
Time taken: 0.298 seconds

Because the partition has been set to a empty directory, the select couldn’t find any records now. That is what “Schema on read” mean.
5. Debug.

bin/hive --hiveconf hive.root.logger=DEBUG,console

This will print many debug information for finding causes such as:

aused by: java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at org.apache.hive.common.util.ReflectionUtil.setJobConf(ReflectionUtil.java:112)
        ... 20 more
Caused by: java.lang.IllegalArgumentException: Compression codec com.hadoop.compression.lzo.LzoCodec not found.
        at org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:139)
        at org.apache.hadoop.io.compress.CompressionCodecFactory.(CompressionCodecFactory.java:179)
        at org.apache.hadoop.mapred.TextInputFormat.configure(TextInputFormat.java:45)
        ... 25 more
Caused by: java.lang.ClassNotFoundException: Class com.hadoop.compression.lzo.LzoCodec not found
        at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)
        at org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:132)
        ... 27 more