Problem about running Hive-2.0.1 on Spark-1.6.2

When I launched Hive-2.0.1 on Spark-1.6.2, it report errors:

After changed “spark.master” from “yarn-cluster” to “local” and add “–hiveconf hive.root.logger=DEBUG,console” to hive command, it printed out details like:

This article suggest replacing fasterxml.jackson package with newer version, but the problem remained the same even after I completed the replacement.
Then I found the [HIVE-13301] in JIRA:


This explains everything clearly: Hive was using jackson-databind-2.1.1 in calcite package instead of lib/jackson-databind-2.4.2.jar, therefore updating it has no effect.
Thus, we should remove shaded jackson-databind-2.1.1 in calcite-avatica-1.5.0.jar:

The Hive uses lib/jackson-databind-2.4.2.jar and runs correctly now.

Using Linear Regression to filter spam message of SMS on Spark

By using the sample from “SMS Spam Collection v. 1“, I write a simple program on Spark to classify normal and spam message.

and the “build.sbt” file contains:

After submit the job to YARN:

We could retrieve the log of job by:

And the result is:

From now on, we can consider the message with negative value as normal and positive value as spam (Or use 10 instead of 0 as boundary).
This is just a example, for the dataset of sample is too small and it could only filter obvious spam message. To identify more spam messages, we need to add more features like ‘the topics of every message’, ‘total number of words’, ‘the frequency of special words’ etc.

Why my Spark job hangs?

After running my small application for Spark of Machine Learning , the job hangs and the Spark UI for it display nothing for more than 5 minutes.
That is weird and I see some logs in UI of yarn:

I haven’t any IP looks like “110.75.x.x”. Why is the Spark job trying to connect it ?
After reviewing the code carefully, I find out the problem:

It is me who forget to add IP to URI of HDFS. Thus, the correct code should be:

Now the application runs correctly.