Monthly Archives: October 2016

Problem about running Hive-2.0.1 on Spark-1.6.2

When I launched Hive-2.0.1 on Spark-1.6.2, it report errors:

FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.spark.SparkTask. com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer$.handled
Type()Ljava/lang/Class;

After changed “spark.master” from “yarn-cluster” to “local” and add “–hiveconf hive.root.logger=DEBUG,console” to hive command, it printed out details like:

java.lang.NoSuchMethodError: com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer$.handledType()Ljava/lang/Class;
        at com.fasterxml.jackson.module.scala.deser.NumberDeserializers$.(ScalaNumberDeserializersModule.scala:49)
        at com.fasterxml.jackson.module.scala.deser.NumberDeserializers$.(ScalaNumberDeserializersModule.scala)
        at com.fasterxml.jackson.module.scala.deser.ScalaNumberDeserializersModule$class.$init$(ScalaNumberDeserializersModule.scala:61)
        at com.fasterxml.jackson.module.scala.DefaultScalaModule.(DefaultScalaModule.scala:19)
        at com.fasterxml.jackson.module.scala.DefaultScalaModule$.(DefaultScalaModule.scala:35)
        at com.fasterxml.jackson.module.scala.DefaultScalaModule$.(DefaultScalaModule.scala)
        at org.apache.spark.rdd.RDDOperationScope$.(RDDOperationScope.scala:81)

This article suggest replacing fasterxml.jackson package with newer version, but the problem remained the same even after I completed the replacement.
Then I found the [HIVE-13301] in JIRA:

This is because calcite has a shaded 2.1.1 version of jackson-databind in it. You can probably remove that from the jar and leave the jackson-databind alone in the hive distro.

This explains everything clearly: Hive was using jackson-databind-2.1.1 in calcite package instead of lib/jackson-databind-2.4.2.jar, therefore updating it has no effect.
Thus, we should remove shaded jackson-databind-2.1.1 in calcite-avatica-1.5.0.jar:

cd ${HIVE_HOME}/lib/
mkdir tmp
cd tmp
# Extract classes from jar
jar -xf ../calcite-avatica-1.5.0.jar
# Remove old jackson-classes in calcite-avatica
find . -name "*jackson*"|xargs rm -rf
# Build new calcite-avatica jar without jackson-classes
jar -cf calcite-avatica-1.5.0.jar *
cp calcite-avatica-1.5.0.jar ../

The Hive uses lib/jackson-databind-2.4.2.jar and runs correctly now.

Using Linear Regression to filter spam message of SMS on Spark

By using the sample from “SMS Spam Collection v. 1“, I write a simple program on Spark to classify normal and spam message.

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.regression.LinearRegressionWithSGD
object SimpleRegression {
  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("Simple Regression")
    val sc = new SparkContext(conf)
    val smsData = sc.textFile("hdfs://127.0.0.1/user/robin/SMSSpamCollection")
    val normal = smsData.filter(line => line.substring(0, 4) == "ham\t")
      .map(line => line.substring(4))
    val spam = smsData.filter(line => line.substring(0, 5) == "spam\t")
      .map(line => line.substring(5))
    // Create a HashingTF instance to map email text to vectors of 10,000 features.
    val tf = new HashingTF(numFeatures = 100000)
    // Each email is split into words, and each word is mapped to one feature.
    val spamFeatures = spam.map(email => tf.transform(email.split(" ")))
    val normalFeatures = normal.map(email => tf.transform(email.split(" ")))
    val positiveExamples = spamFeatures.map(features => LabeledPoint(100, features))
    val negativeExamples = normalFeatures.map(features => LabeledPoint(-100, features))
    val trainingData = positiveExamples.union(negativeExamples)
    trainingData.cache() // Cache since Logistic Regression is an iterative algorithm.
    // Run Linear Regression using the SGD algorithm.
    val model = new LinearRegressionWithSGD().run(trainingData)
    // Test on a positive example (spam) and a negative one (normal).
    val posTest = tf.transform(
      ("Someone has contacted our dating service and entered your phone because they fancy you").split(" "))
    val negTest = tf.transform(
      ("Hi Dady, I started studying Spark the other").split(" "))
    println("Prediction for positive test example: " + model.predict(posTest))
    println("Prediction for negative test example: " + model.predict(negTest))
  }
}

and the “build.sbt” file contains:

lazy val root = (project in file("."))
    .settings(
        name := "test",
        version := "1.0",
        scalaVersion := "2.10.6",
        unmanagedJars in Compile += file("/home/sanbai/hadoop-2.7.2/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar"),
        libraryDependencies ++= Seq(
            "org.apache.spark" % "spark-core_2.10" % "2.0.1",
            "org.apache.spark" % "spark-hive_2.10" % "2.0.1",
            "org.apache.spark" % "spark-mllib_2.10" % "2.0.1",
            "org.apache.spark" % "spark-streaming_2.10" % "2.0.1",
            "org.apache.hadoop" % "hadoop-client" % "2.7.2",
            "org.xerial.snappy" % "snappy-java" % "1.1.2"
        )
    )

After submit the job to YARN:

./bin/spark-submit --class SimpleRegression \
  --master yarn \
  --deploy-mode cluster \
  --driver-memory 2G \
  --executor-memory 14G \
  --executor-cores 1 \
  --num-executors 1 \
  --queue spark \
  /home/sanbai/myspark/target/scala-2.10/test_2.10-1.0.jar

We could retrieve the log of job by:

bin/yarn logs -applicationId application_1473140384986_0096

And the result is:

Prediction for positive test example: 24.238025869328453
Prediction for negative test example: -34.879236141966544

From now on, we can consider the message with negative value as normal and positive value as spam (Or use 10 instead of 0 as boundary).
This is just a example, for the dataset of sample is too small and it could only filter obvious spam message. To identify more spam messages, we need to add more features like ‘the topics of every message’, ‘total number of words’, ‘the frequency of special words’ etc.

Why my Spark job hangs?

After running my small application for Spark of Machine Learning , the job hangs and the Spark UI for it display nothing for more than 5 minutes.
That is weird and I see some logs in UI of yarn:

16/09/30 17:05:33 INFO ipc.Client: Retrying connect to server: user/110.75.167.140:8020. Already tried 0 time(s); maxRetries=45
16/09/30 17:05:53 INFO ipc.Client: Retrying connect to server: user/110.75.167.140:8020. Already tried 1 time(s); maxRetries=45
16/09/30 17:06:13 INFO ipc.Client: Retrying connect to server: user/110.75.167.140:8020. Already tried 2 time(s); maxRetries=45
16/09/30 17:06:33 INFO ipc.Client: Retrying connect to server: user/110.75.167.140:8020. Already tried 3 time(s); maxRetries=45
16/09/30 17:06:53 INFO ipc.Client: Retrying connect to server: user/110.75.167.140:8020. Already tried 4 time(s); maxRetries=45
16/09/30 17:07:13 INFO ipc.Client: Retrying connect to server: user/110.75.167.140:8020. Already tried 5 time(s); maxRetries=45
16/09/30 17:07:33 INFO ipc.Client: Retrying connect to server: user/110.75.167.140:8020. Already tried 6 time(s); maxRetries=45

I haven’t any IP looks like “110.75.x.x”. Why is the Spark job trying to connect it ?
After reviewing the code carefully, I find out the problem:

    val conf = new SparkConf().setAppName("Simple Regression")
    val sc = new SparkContext(conf)
    val smsData = sc.textFile("hdfs://user/sanbai/SMSSpamCollection")

It is me who forget to add IP to URI of HDFS. Thus, the correct code should be:

    val smsData = sc.textFile("hdfs://127.0.0.1/user/sanbai/SMSSpamCollection")

Now the application runs correctly.

Robin on Linux

Monthly Archives: October 2016