Hive

Some hints on Dataproc

When running a job in the cluster of Dataproc, it reported:

java.util.concurrent.ExecutionException: java.lang.ClassNotFoundException: Failed to find data source: BIGQUERY.

The reason is I haven’t added the Jar file for BigQuery. After adding the new Jar file into properties to the template of creating a cluster:

properties:
          spark:spark.jars: gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.11-0.18.1.jar

the job starts to read data from BigQuery tables.

Remember not to use gs://spark-lib/bigquery/spark-bigquery-latest.jar because it will hang your job when you are reading BigQuery tables. Seems even google makes a significant mistake in their cloud platform :p

2. If a PySpark job needs to use some additional packages in the Dataproc cluster, what should we do?

Still need to add more items in the template to let it install pip packages:

    clusterName: robin
    config:
      gceClusterConfig:
        metadata:
          enable-cloud-sql-proxy-on-workers: 'false'
          use-cloud-sql-private-ip: 'false'
          PIP_PACKAGES: 'google-cloud-storage google-api-python-client google-auth'
      initializationActions:
      - executableFile: gs://goog-dataproc-initialization-actions-us-central1/python/pip-install.sh
        executionTimeout: 600s

3. To see how a Hive table be created

show create table <table>;

Books I read in year 2016

Here comes the last day of 2016 year. And it is also the time for me to review my harvest about knowledge, or books.
Frankly speaking, the book “All hard thing about hard things” literally frighten me, and cause me to give up any idea about joining a startup company in China. Maybe this is the best consequence, for many startup companies failed in this end of year and I fortunately avoid this tempest.
Diving more deeper into the ocean of “Hadoop Ecosystem”, or “Big Data”, I find out Spark is really a convenient and powerful framework (compare to MapReduce) which could implement complicated algorithm or data-flow with a few lines of code. Surely, Scala is also a key element for Spark’s efficiency and concision.
Today, even normal person could imagine a sci-fi story about how modern people will fight with Alien invaders. But, what will happen if Aliens attacked the earth in the ancient time? What about Medieval age? Then comes the funny and bold sci-fi novel “The High Crusade”. A group of Medieval army defeat the invader of Alien， and did even more: occupied a frontline planet of a gigantic Alien Empire. It is really out of my imagination 🙂

Build dataflow to get monthly top price of Land Trading in UK

The dataset is downloaded from UK government data web(The total data size is more than 3GB). And, I am using Apache Oozie to run Hive and Sqoop job periodically.
The Hive script “land_price.hql”:

-- Import data from external table to parquet table
SET mapred.job.queue.name=root.default;
SET mapreduce.input.fileinputformat.split.minsize=64000000;
SET mapreduce.input.fileinputformat.split.maxsize=256000000;
CREATE TABLE IF NOT EXISTS realestates (
                          transaction_id STRING,
                          price          INT,
                          date_of_transfer DATE,
                          postcode         STRING,
                          property_type    CHAR(1),
                          old_new          CHAR(1),
                          duration         CHAR(1),
                          paon             STRING,
                          saon             STRING,
                          street           STRING,
                          locality         STRING,
                          town_city        STRING,
                          district         STRING,
                          country          STRING,
                          ppd_category_type CHAR(1),
                          record_status     CHAR(1))
  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
  WITH SERDEPROPERTIES (
                        "separatorChar" = ",",
                        "quoteChar"     = "'"
                       )
  STORED AS textfile
  LOCATION '/user/realestates/';
CREATE TABLE IF NOT EXISTS realestates_p (
                            transaction_id STRING,
                            price          INT,
                            date_of_transfer STRING,
                            postcode         STRING,
                            property_type CHAR(1),
                            old_new          CHAR(1),
                            duration         CHAR(1),
                            paon             STRING,
                            saon             STRING,
                            street           STRING,
                            locality         STRING,
                            town_city        STRING,
                            district         STRING,
                            country          STRING,
                            ppd_category_type CHAR(1),
                            record_status     CHAR(1))
  CLUSTERED BY (transaction_id) INTO 8 BUCKETS
  STORED AS ORC;
INSERT OVERWRITE TABLE realestates_p
    SELECT transaction_id, CAST(SUBSTR(TRIM(price), 2, LENGTH(price)-2) AS INT), date_of_transfer, postcode,
               SUBSTR(property_type, 2, LENGTH(property_type)-2),
               SUBSTR(old_new, 2, LENGTH(old_new)-2),
               SUBSTR(duration, 2, LENGTH(duration)-2),
               paon, saon, street, locality, town_city, district, country,
               SUBSTR(ppd_category_type, 2, LENGTH(ppd_category_type)-2),
               SUBSTR(record_status, 2, LENGTH(record_status)-2)
    FROM realestates;
-- Generate new table for max price of every month
CREATE TABLE month_top
ROW FORMAT DELIMITED FIELDS TERMINATED BY ","
STORED AS TEXTFILE AS
SELECT MAX(price) AS max_price, month, town_city, district, country FROM (
  SELECT SUBSTR(date_of_transfer, 2, 7) AS month, price, street, locality, town_city, district, country FROM realestates_p
) month_view
GROUP BY month, town_city, district, country
SORT BY max_price;

We want Hive job to run on queue “root.default” in YARN (and other jobs in “root.mr”), so we set the “mapred.job.queue.name” to “root.default”.

Remember to use SUBSTR() in Hive to erase quote charactor “\”” when importing data from raw CSV file.

The “coordinator.xml” for Apache Oozie:


  
    1
  
  
    
      ${appDir}

The “workflow.xml” for Apache Oozie:


  
      ${jobTracker}
      ${nameNode}
  
  
  
    
    
  
  
    
      ${jobTracker}
      ${nameNode}
      hive-site.xml
      
        
          mapred.job.queue.name
          root.default
        
      
      
    
    
    
  
  
    
      ${jobTracker}
      ${nameNode}
      
        
          mapred.job.queue.name
          root.default
        
      
      export -Dmapred.job.queue.name=root.default
        --connect jdbc:mysql://192.168.0.1/robin
        --username root --password root --table month_top
        --export-dir /user/hive/warehouse/month_top
    
    
    
  
  
    
      ${jobTracker}
      ${nameNode}
      
        
      
      
        
          mapreduce.job.queuename
          root.mr
        
      
      org.apache.hadoop.examples.terasort.TeraGen
      -Dmapreduce.job.queuename=root.mr
      -Dmapred.map.tasks=96
      ${numRows}
      ${inputDir}
    
    
    
  
  
    
      ${jobTracker}
      ${nameNode}
      
        
      
      
        
          mapreduce.job.queuename
          root.mr
        
        
          mapreduce.input.fileinputformat.split.minsize
          4294967296
        
      
      org.apache.hadoop.examples.terasort.TeraSort
      -Dmapreduce.job.queuename=root.mr
      ${inputDir}
      ${outputDir}

We run two jobs parallelly here: Hive and TeraSort (TeraSort is not useful in real productive environment, but it could be a good substitute for real private job in my company).

The sqoop once report error “javax.xml.parsers.ParserConfigurationException: Feature ‘http://apache.org/xml/features/xinclude’ is not recognized”.
The solution is change file “/usr/lib/hadoop/bin/hadoop” like:
HADOOP_OPTS="$HADOOP_OPTS -Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,NullAppender} \
    -Djavax.xml.parsers.DocumentBuilderFactory=com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl"

“job.properties” for Oozie:

jobTracker=192.168.0.1:8032
nameNode=hdfs://nameservice1
inputDir=/user/hive/tera
outputDir=/user/hive/result
appDir=/user/oozie/myapp
numRows=12345678
oozie.coord.application.path=${appDir}/coordinator.xml
oozie.use.system.libpath=true

Remember to set “oozie.use.system.libpath=true” therefore Oozie could run Hive and Sqoop job correctly.

The script to create MYSQL table:

create table robin.month_top (
    price int(4),
    month char(16),
    town_city char(64),
    district char(128),
    country char(64));

After launch the Oozie coordinator, it will finally put consequent data into MYSQL table:

Looks the land price of “WOKINGHAM” in October 2015 is extremely expensive.

Problem about running Hive-2.0.1 on Spark-1.6.2

When I launched Hive-2.0.1 on Spark-1.6.2, it report errors:

FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.spark.SparkTask. com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer$.handled
Type()Ljava/lang/Class;

After changed “spark.master” from “yarn-cluster” to “local” and add “–hiveconf hive.root.logger=DEBUG,console” to hive command, it printed out details like:

java.lang.NoSuchMethodError: com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer$.handledType()Ljava/lang/Class;
        at com.fasterxml.jackson.module.scala.deser.NumberDeserializers$.(ScalaNumberDeserializersModule.scala:49)
        at com.fasterxml.jackson.module.scala.deser.NumberDeserializers$.(ScalaNumberDeserializersModule.scala)
        at com.fasterxml.jackson.module.scala.deser.ScalaNumberDeserializersModule$class.$init$(ScalaNumberDeserializersModule.scala:61)
        at com.fasterxml.jackson.module.scala.DefaultScalaModule.(DefaultScalaModule.scala:19)
        at com.fasterxml.jackson.module.scala.DefaultScalaModule$.(DefaultScalaModule.scala:35)
        at com.fasterxml.jackson.module.scala.DefaultScalaModule$.(DefaultScalaModule.scala)
        at org.apache.spark.rdd.RDDOperationScope$.(RDDOperationScope.scala:81)

This article suggest replacing fasterxml.jackson package with newer version, but the problem remained the same even after I completed the replacement.
Then I found the [HIVE-13301] in JIRA:

This is because calcite has a shaded 2.1.1 version of jackson-databind in it. You can probably remove that from the jar and leave the jackson-databind alone in the hive distro.

This explains everything clearly: Hive was using jackson-databind-2.1.1 in calcite package instead of lib/jackson-databind-2.4.2.jar, therefore updating it has no effect.
Thus, we should remove shaded jackson-databind-2.1.1 in calcite-avatica-1.5.0.jar:

cd ${HIVE_HOME}/lib/
mkdir tmp
cd tmp
# Extract classes from jar
jar -xf ../calcite-avatica-1.5.0.jar
# Remove old jackson-classes in calcite-avatica
find . -name "*jackson*"|xargs rm -rf
# Build new calcite-avatica jar without jackson-classes
jar -cf calcite-avatica-1.5.0.jar *
cp calcite-avatica-1.5.0.jar ../

The Hive uses lib/jackson-databind-2.4.2.jar and runs correctly now.

“database is locked” in Hue

After launching a long-time HiveQL in SQL Editors of Hue, a small exceptional tip appears under the editor “database is locked”. The solution is to make Hue use Mysql instead of sqlite3. But I am using Hue directly got from github, not Cloudera Release version. So the correct steps should be:

Stop Hue server
Install Mysql and create database ‘hue’

Edit desktop/conf/pseudo-distributed.ini, add these in “[[database]]” section:

    engine=mysql
    host=127.0.0.1
    port=3306
    user=
    password=
    name=hue

Run “make apps” (This is the most important step, as it will install Mysql connector/packages automatically and create meta tables in ‘hue’ database)

Start Hue server

build/env/bin/hue runserver 0.0.0.0:8000

Now we can run long-time query and there will be no error.

Deploy Hive on Spark

The Mapreduce framework is too small for realtime analytic query, so we need to change engine of Hive from “mr” to “spark” (link):
1. set environment for spark:

export SPARK_HOME=/home/my/spark/

2. copy configuration xml file for Hive:

cp /home/my/hive/conf/hive-default.xml.template /home/my/hive/conf/hive-site.xml

and change these configuration items:


  hive.execution.engine
  spark


  spark.executor.memory
  4g


  spark.serializer
  org.apache.spark.serializer.KryoSerializer

Notice: remember to replace all “${system:java.io.tmpdir}/${system:user.name}” in hive-site.xml to “/tmp/my/” (link)

Partitioning and Bucketing Hive table

In previous article, we use sample datasets to join two tables in Hive. To promote the performance of table join, we could also use Partition or Bucket. Let’s first create a parquet format table with partition and bucket:

CREATE TABLE employee_p (
    employee_id INT,
    birthday DATE,
    first_name STRING,
    family_name STRING,
    work_day DATE)
PARTITIONED BY (gender CHAR(1))
CLUSTERED BY (employee_id) INTO 8 BUCKETS
STORED AS PARQUET;

Then import data into it:

SET hive.exec.dynamic.partition.mode=nonstrict;
INSERT OVERWRITE TABLE employee_p PARTITION(gender)
                SELECT employee_id, birthday, first_name, family_name, gender, work_day
                  FROM employee;

But it reports error:

Diagnostic Messages for this Task:
Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{},"value":{"_col0":"499795
","_col1":"1961-07-05","_col2":"Idoia","_col3":"Riefers","_col4":"M","_col5":"1986-01-16"}}
        at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:257)
        at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{},"value":{"_col0":"499795","_col1":"1961-07-05","
_col2":"Idoia","_col3":"Riefers","_col4":"M","_col5":"1986-01-16"}}
        at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:245)
        ... 7 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveFatalException: [Error 20004]: Fatal error occurred when node tried to create too many dynamic partitions. The maximu
m number of dynamic partitions is controlled by hive.exec.max.dynamic.partitions and hive.exec.max.dynamic.partitions.pernode. Maximum was set to: 100
        at org.apache.hadoop.hive.ql.exec.FileSinkOperator.getDynOutPaths(FileSinkOperator.java:922)
        at org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:699)
        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837)
        at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:97)
        at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:236)
        ... 7 more

All the employees have only two genders: “M” and “F”. How could Hive report “too many dynamic partitions”?
To look for the fundamental cause, I use “explain” before my HQL, and finally noticed by this line:

expressions: UDFToInteger(VALUE._col0) (type: int), CAST( VALUE._col1 AS DATE) (type: date), VALUE._col2 (type: string), VALUE._col3 (type: string), CAST( VALUE._col4 AS DATE) (type: date), VALUE._col5 (type: string)

Hive use “_col4” as partition column and it’s type is DATE! So the correct import HQL should put partition column at last:

INSERT OVERWRITE TABLE employee_p PARTITION(gender)
                SELECT employee_id, birthday, first_name, family_name, work_day, gender
                  FROM employee;

We successfully import data by dynamic partitions.
Now we create new parquet format table “salary” (using buckets) and join two tables:

CREATE TABLE salary_p (
    employee_id INT,
    salary INT,
    start_date DATE,
    end_date DATE)
CLUSTERED BY (employee_id) INTO 8 BUCKETS
STORED AS PARQUET;
INSERT OVERWRITE TABLE salary_p SELECT * FROM salary;
/* Join two tables */
SELECT  e.gender, AVG(s.salary) AS avg_salary
  FROM  employee_p AS e
        JOIN  salary_p as s
        ON (e.employee_id == s.employee_id)
  GROUP BY e.gender;

The join operation only cost 90 seconds, much smaller than previous 140 seconds without bucketing and partitioning.

Example datasets for learning Hive

I find two datasets: employee and salary for learning and practicing. After putting two files into HDFS, we just need to create tables:

create external table employee (
    employee_id INT,
    birthday DATE,
    first_name STRING,
    family_name STRING,
    gender CHAR(1),
    work_day DATE)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
with serdeproperties (
    "separatorChar" = ",",
    "quoteChar"     = "'"
)
stored as textfile
location '/employee/';
create external table salary (
    employee_id INT,
    salary INT,
    start_date DATE,
    end_date DATE)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
with serdeproperties (
    "separatorChar" = ",",
    "quoteChar"     = "'"
)
stored as textfile
location '/salary/';

Now we could analyze the data.
Find the oldest 10 employees.

select * from employee order by birthday asc limit 10;

Find all the employees joined the corporation in January 1990.

select * from employee where work_day >= '1990-01-01' and work_day <= '1990-01-31';

Find the top 10 employees earned the highest average salary. Notice we use 'order by' here because 'sort by' only produce local order in reducer.

select e.first_name, e.family_name, avg(s.salary) as avg_salary from
    employee as e join salary as s on (e.employee_id == s.employee_id)
        group by e.first_name, e.family_name order by avg_salary limit 10;

Let's find out whether this corporation has sex discrimination:

  SELECT e.gender, AVG(s.salary) AS avg_salary
    FROM employee AS e
          JOIN salary AS s
            ON (e.employee_id == s.employee_id)
GROUP BY e.gender;

The result is:

F       63767.607741168045
M       63839.90097030445

Looks good 🙂

Use hive to join two datasets

In previous article, I write java code of MapReduce-Framework to join two datasets. Furthermore, I enhanced the code to sort by scores for every student. The complete join-and-sort code is here. It need more than 170 lines of java code to join two tables and sort it. But in product environment, we usually use Hive to do the same work.
By using the same sample datasets:

#create two external tables in Hive.
create external table student(id INT, name STRING)
row format delimited
fields terminated by ','
lines terminated by '\n'
stored as textfile
location '/users/';
create external table course(id INT, course STRING, score TINYINT)
row format delimited
fields terminated by ','
lines terminated by'\n'
stored as textfile
location '/courses/';

Now we could join them:

select s.name, c.course, c.score from student as s
    join course as c on (s.id == c.id)
        sort by s.name, c.score desc;

Just three lines of HQL (Hive Query Language), not 170 lines of java code.
These two tables are very small, thus we could use local mode to run Hive task:

set hive.exec.mode.local.auto=true;

Some tips about Hive

Found some tips about Hive in my learning progress:
1. When I start “bin/hive” at first time, these errors report:

Exception in thread "main" java.lang.RuntimeException: Hive metastore database is not initialized. Please use schematool (e.g. ./schematool -initSchema -dbType ...) to create the schema. If needed, don't forget to include the option to auto-create the underlying database in your JDBC connection string (e.g. ?createDatabaseIfNotExist=true for mysql)

The solution is simple:

mv metastore_db metastore_db.tmp
schematool -initSchema -dbType derby

Actually, we’d better use mysql instead of derby for multi-users environment.
2. Control the number of mappers for SQL jobs. If a SQL job use too much mappers, the context-switch of processes (include frequent launch/stop for JVM) will cost extra CPU resource. We could use

set mapreduce.input.fileinputformat.split.maxsize=...
set mapreduce.input.fileinputformat.split.minsize=...

to change the number of mappers for all the SQL jobs.
3. After I imported 1TB data into a “Orc format” table, the size of the table is just 250GB. But after I imported 1TB data into a “Parquet format” table, the size is 900GB. Looks Apache Orc has more effective compression algorithm for custom data.
4. Using partitions carefully.

create table users (name string, age smallint) partitioned by (ca string);

Now we have a table named “users” and is partitioned by field “ca”.

hive> insert into users values("robindong", 36);
FAILED: SemanticException 1:12 Need to specify partition columns because the destination table is partitioned. Error encountered near token 'users'
hive> insert into users values("robindong", 36, "China");
FAILED: SemanticException 1:12 Need to specify partition columns because the destination table is partitioned. Error encountered near token 'users'

We can’t using normal INSERT clause to insert record into partitioned table. Trying this:

insert into users partition (ca="China") values("robindong", 36);

Now, there is a record in HDFS directory “/user/hive/warehouse/users/ca=China/”
In the book “Programming Hive”, it said we could copy the data in a partition directory to AWS s3 and then set partition to it. But, what if I set the partition to a new empty HDFS directory? Let’s try:

alter table users partition(ca = 'China') set location '/empty/';
hive> select * from users where ca='China';
OK
Time taken: 0.298 seconds

Because the partition has been set to a empty directory, the select couldn’t find any records now. That is what “Schema on read” mean.
5. Debug.

bin/hive --hiveconf hive.root.logger=DEBUG,console

This will print many debug information for finding causes such as:

aused by: java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at org.apache.hive.common.util.ReflectionUtil.setJobConf(ReflectionUtil.java:112)
        ... 20 more
Caused by: java.lang.IllegalArgumentException: Compression codec com.hadoop.compression.lzo.LzoCodec not found.
        at org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:139)
        at org.apache.hadoop.io.compress.CompressionCodecFactory.(CompressionCodecFactory.java:179)
        at org.apache.hadoop.mapred.TextInputFormat.configure(TextInputFormat.java:45)
        ... 25 more
Caused by: java.lang.ClassNotFoundException: Class com.hadoop.compression.lzo.LzoCodec not found
        at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)
        at org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:132)
        ... 27 more