Yearly Archives: 2016

Books I read in year 2016

Here comes the last day of 2016 year. And it is also the time for me to review my harvest about knowledge, or books.
Frankly speaking, the book “All hard thing about hard things” literally frighten me, and cause me to give up any idea about joining a startup company in China. Maybe this is the best consequence, for many startup companies failed in this end of year and I fortunately avoid this tempest.
Diving more deeper into the ocean of “Hadoop Ecosystem”, or “Big Data”, I find out Spark is really a convenient and powerful framework (compare to MapReduce) which could implement complicated algorithm or data-flow with a few lines of code. Surely, Scala is also a key element for Spark’s efficiency and concision.
Today, even normal person could imagine a sci-fi story about how modern people will fight with Alien invaders. But, what will happen if Aliens attacked the earth in the ancient time? What about Medieval age? Then comes the funny and bold sci-fi novel “The High Crusade”. A group of Medieval army defeat the invader of Alien， and did even more: occupied a frontline planet of a gigantic Alien Empire. It is really out of my imagination 🙂

The type of variables in Python

Haven’t written python code for more than one year, I met this simple problem:

import zkpython
....
res = zookeeper.get_children(handle, path, zk_watcher)
a = len(res)
b = res[0]
print a, b
if a >= b:
    print "OK"

Even the code have print out the value of “a” and “b” as 2 and 1, the condition check “if a >= b:” is false!
Spending more than 10 minutes, I eventually get the reason: the type of “a” is “int” but “b” is “string” (and the interpreter of Python will not report any warning about this “inconsistency”). I should have been taking enough care of the type of these variables.
Seems “print” can’t reveal adequate details of a variable, therefore it is highly suggested we using “pprint” instead of “print”.

import pprint
...
pprint.pprint(a)
pprint.pprint(b)

The result will be

2
'1'

My understanding of CNN (Convolutional Neural Network)

The classic Neural Network of Machine Learning usually use fully-connection, which will cost too much computing resource to get final result if the inputs are high-resolution images. So comes the Convolutional Neural Network. CNN (Convolutional Neural Network) splits the whole big image into small pieces (called Receptive Fields), and do some “Convolutional Operations” (actually are some image transformations, also called Kernels) on each Receptive Field, then the pooling operation (usually max-polling, which is simply collect a biggest feature weight in a 2X2 matrix).
Receptive Fields is easy to understand, but why do it use different kind of “Convolutional Operations” on them? In my opinion, “Convolutional Operations” means using different kind of Kernel Functions to transfer the same image (for example: sharpen the image, or detect the edge of object in image), so they could reveal different views of the same image.
These different Kernel Functions review different “Features” of a image, thus we call them “Feature Maps”:
Convolutional Neural Network
From http://mxnet.io/tutorials/python/mnist.html
(The matrix of light-yellow is just transferred from light-gray matrix on its left)
By using Receptive Fields and max-pooling, the number of neurons will become very small gradually, which will make computing (or regression) much more easy and fast:
Convolutional Neural Network
From http://www.cnblogs.com/bzjia-blog/p/3442788.html
Therefore, I reckon the main purpose of using CNN is to reduce the difficulty of computing result of a fully-connected Neural Network.

Build dataflow to get monthly top price of Land Trading in UK

The dataset is downloaded from UK government data web(The total data size is more than 3GB). And, I am using Apache Oozie to run Hive and Sqoop job periodically.
The Hive script “land_price.hql”:

-- Import data from external table to parquet table
SET mapred.job.queue.name=root.default;
SET mapreduce.input.fileinputformat.split.minsize=64000000;
SET mapreduce.input.fileinputformat.split.maxsize=256000000;
CREATE TABLE IF NOT EXISTS realestates (
                          transaction_id STRING,
                          price          INT,
                          date_of_transfer DATE,
                          postcode         STRING,
                          property_type    CHAR(1),
                          old_new          CHAR(1),
                          duration         CHAR(1),
                          paon             STRING,
                          saon             STRING,
                          street           STRING,
                          locality         STRING,
                          town_city        STRING,
                          district         STRING,
                          country          STRING,
                          ppd_category_type CHAR(1),
                          record_status     CHAR(1))
  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
  WITH SERDEPROPERTIES (
                        "separatorChar" = ",",
                        "quoteChar"     = "'"
                       )
  STORED AS textfile
  LOCATION '/user/realestates/';
CREATE TABLE IF NOT EXISTS realestates_p (
                            transaction_id STRING,
                            price          INT,
                            date_of_transfer STRING,
                            postcode         STRING,
                            property_type CHAR(1),
                            old_new          CHAR(1),
                            duration         CHAR(1),
                            paon             STRING,
                            saon             STRING,
                            street           STRING,
                            locality         STRING,
                            town_city        STRING,
                            district         STRING,
                            country          STRING,
                            ppd_category_type CHAR(1),
                            record_status     CHAR(1))
  CLUSTERED BY (transaction_id) INTO 8 BUCKETS
  STORED AS ORC;
INSERT OVERWRITE TABLE realestates_p
    SELECT transaction_id, CAST(SUBSTR(TRIM(price), 2, LENGTH(price)-2) AS INT), date_of_transfer, postcode,
               SUBSTR(property_type, 2, LENGTH(property_type)-2),
               SUBSTR(old_new, 2, LENGTH(old_new)-2),
               SUBSTR(duration, 2, LENGTH(duration)-2),
               paon, saon, street, locality, town_city, district, country,
               SUBSTR(ppd_category_type, 2, LENGTH(ppd_category_type)-2),
               SUBSTR(record_status, 2, LENGTH(record_status)-2)
    FROM realestates;
-- Generate new table for max price of every month
CREATE TABLE month_top
ROW FORMAT DELIMITED FIELDS TERMINATED BY ","
STORED AS TEXTFILE AS
SELECT MAX(price) AS max_price, month, town_city, district, country FROM (
  SELECT SUBSTR(date_of_transfer, 2, 7) AS month, price, street, locality, town_city, district, country FROM realestates_p
) month_view
GROUP BY month, town_city, district, country
SORT BY max_price;

We want Hive job to run on queue “root.default” in YARN (and other jobs in “root.mr”), so we set the “mapred.job.queue.name” to “root.default”.

Remember to use SUBSTR() in Hive to erase quote charactor “\”” when importing data from raw CSV file.

The “coordinator.xml” for Apache Oozie:


  
    1
  
  
    
      ${appDir}

The “workflow.xml” for Apache Oozie:


  
      ${jobTracker}
      ${nameNode}
  
  
  
    
    
  
  
    
      ${jobTracker}
      ${nameNode}
      hive-site.xml
      
        
          mapred.job.queue.name
          root.default
        
      
      
    
    
    
  
  
    
      ${jobTracker}
      ${nameNode}
      
        
          mapred.job.queue.name
          root.default
        
      
      export -Dmapred.job.queue.name=root.default
        --connect jdbc:mysql://192.168.0.1/robin
        --username root --password root --table month_top
        --export-dir /user/hive/warehouse/month_top
    
    
    
  
  
    
      ${jobTracker}
      ${nameNode}
      
        
      
      
        
          mapreduce.job.queuename
          root.mr
        
      
      org.apache.hadoop.examples.terasort.TeraGen
      -Dmapreduce.job.queuename=root.mr
      -Dmapred.map.tasks=96
      ${numRows}
      ${inputDir}
    
    
    
  
  
    
      ${jobTracker}
      ${nameNode}
      
        
      
      
        
          mapreduce.job.queuename
          root.mr
        
        
          mapreduce.input.fileinputformat.split.minsize
          4294967296
        
      
      org.apache.hadoop.examples.terasort.TeraSort
      -Dmapreduce.job.queuename=root.mr
      ${inputDir}
      ${outputDir}

We run two jobs parallelly here: Hive and TeraSort (TeraSort is not useful in real productive environment, but it could be a good substitute for real private job in my company).

The sqoop once report error “javax.xml.parsers.ParserConfigurationException: Feature ‘http://apache.org/xml/features/xinclude’ is not recognized”.
The solution is change file “/usr/lib/hadoop/bin/hadoop” like:
HADOOP_OPTS="$HADOOP_OPTS -Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,NullAppender} \
    -Djavax.xml.parsers.DocumentBuilderFactory=com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl"

“job.properties” for Oozie:

jobTracker=192.168.0.1:8032
nameNode=hdfs://nameservice1
inputDir=/user/hive/tera
outputDir=/user/hive/result
appDir=/user/oozie/myapp
numRows=12345678
oozie.coord.application.path=${appDir}/coordinator.xml
oozie.use.system.libpath=true

Remember to set “oozie.use.system.libpath=true” therefore Oozie could run Hive and Sqoop job correctly.

The script to create MYSQL table:

create table robin.month_top (
    price int(4),
    month char(16),
    town_city char(64),
    district char(128),
    country char(64));

After launch the Oozie coordinator, it will finally put consequent data into MYSQL table:

Looks the land price of “WOKINGHAM” in October 2015 is extremely expensive.

Some tips about using Apache Flume

Question1: Flume process report “Expected timestamp in the Flume event headers, but it was null”
Solution1: The flume process expect to receive events with timestamp, but the event doesn’t have. For sending normal text event to flume, we need to tell it to generate timestamp with every events by itself. Put below line into configuration:

a1.sinks.k1.hdfs.useLocalTimeStamp=true

Question2: HDFS Sink generates tremendous small files with high frequency even we have set “a1.sinks.k2.hdfs.rollInterval=600”
Solution2: We still need to set “rollCount” and “rollSize”, as Flume will roll file if any condition of “rollInterval”, “rollCOunt”, or “rollSize” been fulfilled.

a1.sinks.k1.hdfs.rollInterval=600
a1.sinks.k1.hdfs.rollCount=0
a1.sinks.k1.hdfs.rollSize=0

Question3: Flume process exit and report “Exception in thread “SinkRunner-PollingRunner-DefaultSinkProcessor” java.lang.OutOfMemoryError: GC overhead limit exceeded”
Solution3: Simply add “JAVA_OPTS=”-Xms12g -Xmx12g” (My server has more than 16G physical memory) into “/usr/lib/flume-ng/bin/flume-ng”
—— My configuration file for Flume ——

a1.sources=r1 r2
a1.sinks=k1 k2
a1.channels=c1 c2
a1.sources.r1.type=netcat
a1.sources.r1.bind=0.0.0.0
a1.sources.r1.port=44444
a1.sources.r1.channels=c1
a1.sources.r2.type=netcat
a1.sources.r2.bind=0.0.0.0
a1.sources.r2.port=55555
a1.sources.r2.channels=c2
a1.sinks.k1.type=hdfs
a1.sinks.k1.hdfs.path=/user/realestates/CN/%Y-%m-%d/
a1.sinks.k1.hdfs.filePrefix=re-
a1.sinks.k1.hdfs.rollInterval=600
a1.sinks.k1.hdfs.rollCount=0
a1.sinks.k1.hdfs.rollSize=0
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 1
a1.sinks.k1.hdfs.roundUnit = hour
a1.sinks.k1.hdfs.writeFormat=Text
a1.sinks.k1.hdfs.useLocalTimeStamp=true
a1.sinks.k1.channel=c1
a1.sinks.k2.type=hdfs
a1.sinks.k2.hdfs.path=/user/realestates/AU/%Y-%m-%d/
a1.sinks.k2.hdfs.filePrefix=re-
a1.sinks.k2.hdfs.rollInterval=600
a1.sinks.k2.hdfs.rollCount=0
a1.sinks.k2.hdfs.rollSize=0
a1.sinks.k2.hdfs.round = true
a1.sinks.k2.hdfs.roundValue = 1
a1.sinks.k2.hdfs.roundUnit = hour
a1.sinks.k2.hdfs.writeFormat=Text
a1.sinks.k2.hdfs.useLocalTimeStamp=true
a1.sinks.k2.channel=c2
a1.channels.c1.type=memory
a1.channels.c1.capacity=23456789
a1.channels.c1.transactionCapacity=23456789
a1.channels.c2.type=memory
a1.channels.c2.capacity=23456789
a1.channels.c2.transactionCapacity=23456789

The startup command for Cloudera Environment:

sudo -u hdfs flume-ng agent --conf ./ --conf-file example.conf \
     -name a1 -Dflume.root.logger=INFO,console \
     -Dorg.apache.flume.log.rawdata=true

Use Oozie to run terasort

The better choice of “Action” for running terasort test case in Oozie is “Java Action” instead of “Mapreduce Action” because terasort need to run

TeraInputFormat.writePartitionFile(job, partitionFile);

first and then load ‘partitonFile’ by “TotalOrderPartitioner”. It’s not a simple Mapreduce job which need merely a few propertyies.
The directory of this”TerasortApp” which using “Java Action” of Oozie looks just like:

TerasortApp/
├── job.properties
├── lib
│   └── hadoop-mapreduce-examples.jar
└── workflow.xml

The core of this App is “workflow.xml”:

                                                                                              [12/1991]
  
  
    
      ${jobTracker}
      ${nameNode}
      
        
      
      org.apache.hadoop.examples.terasort.TeraGen
      -Dmapred.map.tasks=96
      ${numRows}
      ${inputDir}
    
    
    
  
  
    
      ${jobTracker}
      ${nameNode}
      
        
      
      
        
          mapreduce.input.fileinputformat.split.minsize
          4294967296
        
      
      org.apache.hadoop.examples.terasort.TeraSort
      ${inputDir}
      ${outputDir}
      
    
    
    
  
  
    Failed to terasort!

Note 1. In Cloudera environment, The Web UI will fail in the last step of creating sharelib for Oozie Service. To fix this problem:

$sudo -u oozie /usr/lib/oozie/bin/oozie-setup.sh sharelib create -fs hdfs://localhost:8020 -locallib /usr/lib/oozie/oozie-sharelib-yarn/
$sudo -u oozie oozie  admin -shareliblist -oozie http://localhost:11000/oozie
[Available ShareLib]
oozie
hive
distcp
hcatalog
sqoop
mapreduce-streaming
spark
hive2
pig

Note 2. We can’t use property of ‘mapred.map.tasks’ to change the number of mappers in Terasort because it is actually decided by class ‘TotalOrderPartitioner’. Therefore I use ‘mapreduce.input.fileinputformat.split.minsize’ property to limit the number of mappers.

Using “sysbench” to test memory performance

Sysbench is a powerful testing tool for CPU / Memory / Mysql etc. Three years ago, I used to test performance of MYSQL by using it.
Yesterday, I used Sysbench to test memory bandwidth of my server.
By using command:

sysbench --test=memory --memory-block-size=1M --memory-total-size=100G --num-threads=1 run

It reported the memory bandwidth could reach 8.4GB/s, which did make sense for me.
But after decrease the block size (Change 1M to 1K):

sysbench --test=memory --memory-block-size=1K --memory-total-size=100G --num-threads=1 run

The memory bandwidth reported by Sysbench became only 2GB/s
This regression of memory performance really confuse me. Maybe the memory of modern machines has some kind of “Max limited frequency” so we can’t access memory with too high frequency?
After checked the code of Sysbench, I found out its logic about memory test is just like this program (I wrote it myself):

/* mytest.c */
#include 
#include 
#include 
const long DATA = (100 * 1024 * 1048576LL); /* 100G data */
int main(int argc, char *argv[]) {
    volatile int tmp = 0;
    int *buffer, *end, *begin;
    long i, loop, block_size;
    struct timeval before, after;
    if (argc < 2) {
        return -1;
    }
    block_size = atoi(argv[1]);
    buffer = (int *)malloc(block_size);
    end = (int*)(((char *)buffer) + block_size);
    loop = (long)DATA / block_size;
    gettimeofday(&before, NULL);
    for (i = 0; i < loop; i++) {
        for (begin = buffer; begin < end; begin++) {
            *begin = tmp;
        }
    }
    gettimeofday(&after, NULL);
    printf("time: %lu\n", (after.tv_sec * 1000000 + after.tv_usec)
        - (before.tv_sec * 1000000 + before.tv_usec));
    free(buffer);
}

But this test program cost only 14 seconds (Sysbench cost 49 seconds). To find out the root cause, we need to use a more powerful tool -- perf:

# perf stat -e cache-misses,faults,branch-misses ./mytest 1048576
Performance counter stats for './my 1048576':
            90,395 cache-misses
               400 faults
           178,554 branch-misses
      14.825497139 seconds time elapsed
# perf stat -e cache-misses,faults,branch-misses sysbench --test=memory --memory-block-size=1K --memory-total-size=100G --num-threads=1 run
Performance counter stats for 'sysbench --test=memory --memory-block-size=1K --memory-total-size=100G --num-threads=1 run':
           739,223 cache-misses
               825 faults
           531,908 branch-misses
      49.264963322 seconds time elapsed

They have totally different CPU cache-misses. The root cause is because Sysbench use a complicate framework to support different test targets (Mysql/Memory ...), which need to pass a structure named "request" and many other arguments in and out of execution_request() function many times in one request (accessing 1K memory, in our scenario), this overload becomes big when block size is too small.
The conclusion is: don't use Sysbench to test memory performance by using too small block size, better bigger than 1MB.
Ref: by Coly Li 's teaching, memory do have "top limit access frequency" (link). Take DDR4-1866 for example: it's data rate is 1866MT/s （MT = Mega Transfer) and every transfer takes 8 bytes, so we can access memory more than 1 billion times per second, theoretically.

Install CDH(Cloudera Distribution Hadoop) by Cloudera Manager

These days I was trying to install Cloudera-5.8.3 on my centos-7 machines, and here are some steps for operation and tips for trouble shooting:
0. If you are not in USA, the speed of network for accessing Cloudera Repository of RPMS(or Parcels) is desperately slow, thus we need to move CM (Cloudera Manager) Repo and CDH Repo to local.
Create local CM Repo
Create local CDH Repo
1. Install Cloudera Manager (steps)
2. Start Cloudera Manager

sudo cmf-server start

But it report:

org.springframework.beans.factory.support.FactoryBeanRegistrySupport.doGetObjectFromFactoryBean(FactoryBeanRegistrySupport.java:142)
... 22 more
Caused by: org.hibernate.service.classloading.spi.ClassLoadingException: HHH010003: JDBC Driver class not found: com.mysql.jdbc.Driver
at org.hibernate.service.jdbc.connections.internal.C3P0ConnectionProvider.configure(C3P0ConnectionProvider.java:142)
at org.hibernate.service.internal.StandardServiceRegistryImpl.configureService(StandardServiceRegistryImpl.java:75)

In centos-7, the solution is:

# Install Mysql Driver for Java
sudo yum install mysql-connector-java -y
# Set jar to CLASSPATH
export CMF_JDBC_DRIVER_JAR=/usr/share/java/mysql-connector-java.jar
# Start Cloudera Manager again
sudo cmf-server start

Also need to run “sudo ./cloudera-manager-installer.bin –skip_repo_package=1” to create “db.properties”.
3. Login to the Cloudera Manager(port: 7180) and follow the steps of Wizard to create a new cluster. (Choose the local repository for installation will bring favorable fast speed 🙂
Make sure the hostname of every node is correct. And by using “Host Inspector”, we can reveal many potential problems in these machines.
After tried many times to setup cluster, I found this error in logs of some nodes:

Error, CM server guid updated, expected 85587073-270d-43d9-a44a-e213d9f7e45b, received 4c1402a5-8364-4598-a382-0c760710e897

The solution is simple:

#For the error node
sudo rm -rf /var/lib/cloudera-scm-agent/cm_guid

and restart Cloudera Manager Agent on these nodes.
I also confronted a problem that installation progress has hanged on this message:

Acquiring installation lock...

There isn’t any process of “yum” running in the node, so why it still acquire installation lock? The answer is:

sudo rm -rf /tmp/.scm_prepare_node.lock

4. After many fails and retry, I eventually setup the Hadoop Ecosystem of CDH:

When upgrading or downgrading a Cloudera Cluster, your may see this problem:

The solution is (if in ‘single user mode’):

sudo chown cloudera-scm:cloudera-scm /run/cloudera-scm-agent/ -R
sudo chown cloudera-scm:cloudera-scm /var/lib/cloudera-scm-agent/ -R

and try it again.
When staring ResourceManager, it failed and report:

2017-06-05 16:31:58,812 WARN org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Update thread interrupted. Exiting.
2017-06-05 16:31:58,813 WARN org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Continuous scheduling thread interrupted. Exiting.
java.lang.InterruptedException: sleep interrupted
    at java.lang.Thread.sleep(Native Method)
    at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:319)
2017-06-05 16:31:58,814 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService: Interrupted while waiting to reload alloc configuration
2017-06-05 16:31:58,814 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: AMLivelinessMonitor thread interrupted
2017-06-05 16:31:58,814 ERROR org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: ExpiredTokenRemover received java.lang.InterruptedException: sleep interrupted
2017-06-05 16:31:58,814 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.ContainerAllocationExpirer thread interrupted
2017-06-05 16:31:58,814 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: AMLivelinessMonitor thread interrupted
2017-06-05 16:31:58,815 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Transitioned to standby state
2017-06-05 16:31:58,816 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error starting ResourceManager
org.apache.hadoop.yarn.webapp.WebAppException: Error starting http server
    at org.apache.hadoop.yarn.webapp.WebApps$Builder.start(WebApps.java:278)
    at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startWepApp(ResourceManager.java:990)
    at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1090)
    at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
    at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1222)
Caused by: java.io.IOException: Problem in starting http server. Server handlers failed
    at org.apache.hadoop.http.HttpServer2.start(HttpServer2.java:912)
    at org.apache.hadoop.yarn.webapp.WebApps$Builder.start(WebApps.java:273)
    ... 4 more
2017-06-05 16:31:58,818 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: SHUTDOWN_MSG:

The reason of this error is: there is a Non-Cloudera version of zookeeper installed on the host. Remove it and reinstall zookeeper from CDH, the yarn-resource-manager will be launched successfully.
If meet “Deploy Client Configuration failed” when create new service, just add sudo nopassword to cloudera-scm user.

cloudera-scm    ALL=(ALL)       NOPASSWD: ALL

Using Pig to join two tables and sort it

Having two tables: salary and employee，we can use Pig to find the most high-salary employees:

salary = LOAD '/user/robin/salaries/salaries.csv' USING PigStorage(',') AS (uid:int, salary:int, begin:chararray, end:chararray);
employee = LOAD '/user/robin/employees/employees.csv' USING PigStorage(',') AS (uid:int, birth:chararray, givenname:chararray, familyname:chararray, gender:chararray, work:chararray);
jo = JOIN employee BY uid, salary BY uid;
res = ORDER (
             FOREACH (
                      GROUP jo BY (employee::uid, employee::birth, employee::givenname, employee::familyname, employee::gender, employee::work)
             )
             GENERATE group.employee::uid, group.employee::givenname, group.employee::familyname, AVG(jo.salary::salary) AS avg_salary
      ) BY avg_salary DESC;
fs -rmr /user/sanbai/join_result;
STORE res INTO '/user/robin/join_result' USING PigStorage(',');

The result is:

109334,'Tsutomu','Alameldin',141835.33333333334
205000,'Charmane','Griswold',141064.63636363635
43624,'Tokuyasu','Pesch',138492.94444444444
493158,'Lidong','Meriste',138312.875
37558,'Juichirou','Thambidurai',138215.85714285713
276633,'Shin','Birdsall',136711.73333333334
238117,'Mitsuyuki','Stanfel',136026.2
46439,'Ibibia','Junet',135747.73333333334
254466,'Honesty','Mukaidono',135541.0625
253939,'Sanjai','Luders',135042.25
....

Problem about running Hive-2.0.1 on Spark-1.6.2

When I launched Hive-2.0.1 on Spark-1.6.2, it report errors:

FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.spark.SparkTask. com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer$.handled
Type()Ljava/lang/Class;

After changed “spark.master” from “yarn-cluster” to “local” and add “–hiveconf hive.root.logger=DEBUG,console” to hive command, it printed out details like:

java.lang.NoSuchMethodError: com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer$.handledType()Ljava/lang/Class;
        at com.fasterxml.jackson.module.scala.deser.NumberDeserializers$.(ScalaNumberDeserializersModule.scala:49)
        at com.fasterxml.jackson.module.scala.deser.NumberDeserializers$.(ScalaNumberDeserializersModule.scala)
        at com.fasterxml.jackson.module.scala.deser.ScalaNumberDeserializersModule$class.$init$(ScalaNumberDeserializersModule.scala:61)
        at com.fasterxml.jackson.module.scala.DefaultScalaModule.(DefaultScalaModule.scala:19)
        at com.fasterxml.jackson.module.scala.DefaultScalaModule$.(DefaultScalaModule.scala:35)
        at com.fasterxml.jackson.module.scala.DefaultScalaModule$.(DefaultScalaModule.scala)
        at org.apache.spark.rdd.RDDOperationScope$.(RDDOperationScope.scala:81)

This article suggest replacing fasterxml.jackson package with newer version, but the problem remained the same even after I completed the replacement.
Then I found the [HIVE-13301] in JIRA:

This is because calcite has a shaded 2.1.1 version of jackson-databind in it. You can probably remove that from the jar and leave the jackson-databind alone in the hive distro.

This explains everything clearly: Hive was using jackson-databind-2.1.1 in calcite package instead of lib/jackson-databind-2.4.2.jar, therefore updating it has no effect.
Thus, we should remove shaded jackson-databind-2.1.1 in calcite-avatica-1.5.0.jar:

cd ${HIVE_HOME}/lib/
mkdir tmp
cd tmp
# Extract classes from jar
jar -xf ../calcite-avatica-1.5.0.jar
# Remove old jackson-classes in calcite-avatica
find . -name "*jackson*"|xargs rm -rf
# Build new calcite-avatica jar without jackson-classes
jar -cf calcite-avatica-1.5.0.jar *
cp calcite-avatica-1.5.0.jar ../

The Hive uses lib/jackson-databind-2.4.2.jar and runs correctly now.