Some tips about Hive

Found some tips about Hive in my learning progress:

1. When I start “bin/hive” at first time, these errors report:

The solution is simple:

Actually, we’d better use mysql instead of derby for multi-users environment.

2. Control the number of mappers for SQL jobs. If a SQL job use too much mappers, the context-switch of processes (include frequent launch/stop for JVM) will cost extra CPU resource. We could use

to change the number of mappers for all the SQL jobs.

3. After I imported 1TB data into a “Orc format” table, the size of the table is just 250GB. But after I imported 1TB data into a “Parquet format” table, the size is 900GB. Looks Apache Orc has more effective compression algorithm for custom data.

4. Using partitions carefully.

Now we have a table named “users” and is partitioned by field “ca”.

Now, there is a record in HDFS directory “/user/hive/warehouse/users/ca=China/”
In the book <>, it said we could copy the data in a partition directory to AWS s3 and then set partition to it. But, what if I set the partition to a new empty HDFS directory? Let’s try:

Because the partition has been set to a empty directory, the select couldn’t find any records now. That is what “Schema on read” mean.

5. Debug.

This will print many debug information for finding causes such as:

Leave a Reply

Your email address will not be published. Required fields are marked *

*
*

This site uses Akismet to reduce spam. Learn how your comment data is processed.