terasort

Use Oozie to run terasort

The better choice of “Action” for running terasort test case in Oozie is “Java Action” instead of “Mapreduce Action” because terasort need to run

TeraInputFormat.writePartitionFile(job, partitionFile);

first and then load ‘partitonFile’ by “TotalOrderPartitioner”. It’s not a simple Mapreduce job which need merely a few propertyies.
The directory of this”TerasortApp” which using “Java Action” of Oozie looks just like:

TerasortApp/
├── job.properties
├── lib
│   └── hadoop-mapreduce-examples.jar
└── workflow.xml

The core of this App is “workflow.xml”:

                                                                                              [12/1991]
  
  
    
      ${jobTracker}
      ${nameNode}
      
        
      
      org.apache.hadoop.examples.terasort.TeraGen
      -Dmapred.map.tasks=96
      ${numRows}
      ${inputDir}
    
    
    
  
  
    
      ${jobTracker}
      ${nameNode}
      
        
      
      
        
          mapreduce.input.fileinputformat.split.minsize
          4294967296
        
      
      org.apache.hadoop.examples.terasort.TeraSort
      ${inputDir}
      ${outputDir}
      
    
    
    
  
  
    Failed to terasort!

Note 1. In Cloudera environment, The Web UI will fail in the last step of creating sharelib for Oozie Service. To fix this problem:

$sudo -u oozie /usr/lib/oozie/bin/oozie-setup.sh sharelib create -fs hdfs://localhost:8020 -locallib /usr/lib/oozie/oozie-sharelib-yarn/
$sudo -u oozie oozie  admin -shareliblist -oozie http://localhost:11000/oozie
[Available ShareLib]
oozie
hive
distcp
hcatalog
sqoop
mapreduce-streaming
spark
hive2
pig

Note 2. We can’t use property of ‘mapred.map.tasks’ to change the number of mappers in Terasort because it is actually decided by class ‘TotalOrderPartitioner’. Therefore I use ‘mapreduce.input.fileinputformat.split.minsize’ property to limit the number of mappers.

Terasort for Spark (part2 / 2)

In previous article, we used Spark to sort large dataset generated by Teragen. But it cost too much time than Hadoop Mapreduce framework, so we are going to optimize it.
By looking at the Spark UI for profiling, we find out the “Shuffle” read/write too much data from/to the hard-disk, this will surely hurt the performance severely.

In “Terasort” of Hadoop, it use “class TotalOrderPartition” to map all the data to a large mount of partitions by ordering, so every “Reduce” job only need to sort data in one task (almost don’t need any shuffle from other partition). This will save a lot of network bandwidth and CPU usage.
Therefore we could modify our Scala code to sort every partition locally:

    logData.partitionBy(new TeraSortPartitioner(512))
      .mapPartitions(iter => {
        iter.toVector.sortBy(kv => kv._1.getBytes).iterator
      })
      .saveAsNewAPIHadoopFile[TeraOutputFormat]("hdfs://127.0.0.1/output")

and the spark-submit should also be changed:

./bin/spark-submit --class TerasortApp \
  --master yarn \
  --deploy-mode cluster \
  --driver-memory 2000M \
  --executor-memory 5200M \
  --executor-cores 1 \
  --num-executors 64 \
  --conf spark.yarn.executor.memoryOverhead=900 \
  --conf spark.shuffle.memoryFraction=0.6 \
  --conf spark.kryoserializer.buffer.max=2000m \
  --conf "spark.executor.extraJavaOptions=-XX:+UseG1GC" \
  --queue spark \
  /home/sanbai/myspark/target/scala-2.10/Terasort_2.10-1.0.jar

This time, the job only cost 10 minutes for sorting data!
Screenshot from “Job Browser” of Hue:

Terasort for Spark (part1 / 2)

We could use Spark to sort all the data which is generated by Teragen of Hadoop.
TerasortApp.scala

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.Partitioner
import org.apache.spark.rdd._
import org.apache.hadoop.examples.terasort.TeraInputFormat
import org.apache.hadoop.examples.terasort.TeraOutputFormat
import org.apache.hadoop.io.Text
import com.google.common.primitives.Longs
import com.google.common.primitives.UnsignedBytes
case class TeraSortPartitioner(numPartitions: Int) extends Partitioner {
  import TeraSortPartitioner._
  val rangePerPart = (max - min) / numPartitions
  override def getPartition(key: Any): Int = {
    val b = key.asInstanceOf[Text].getBytes()
    val prefix = Longs.fromBytes(0, b(0), b(1), b(2), b(3), b(4), b(5), b(6))
    (prefix / rangePerPart).toInt
  }
}
object TeraSortPartitioner {
  val min = Longs.fromBytes(0, 0, 0, 0, 0, 0, 0, 0)
  val max = Longs.fromBytes(0, -1, -1, -1, -1, -1, -1, -1)  // 0xff = -1
}
object TerasortApp {
  implicit val caseInsensitiveOrdering = UnsignedBytes.lexicographicalComparator
  def main(args: Array[String]) {
    val conf = new SparkConf()
      .registerKryoClasses(Array(classOf[Text]))
      .setAppName("Simple Application")
    val sc = new SparkContext(conf)
    var logData = sc.newAPIHadoopFile("hdfs://127.0.0.1/tera", classOf[TeraInputFormat], classOf[Text], classOf[Text])
    logData.partitionBy(new TeraSortPartitioner(logData.partitions.size))
      .sortBy(kv => kv._1.getBytes)
      .saveAsNewAPIHadoopFile[TeraOutputFormat]("hdfs://127.0.0.1/output")
  }
}

build.sbt

lazy val root = (project in file("."))
    .settings(
        name := "Terasort",
        version := "1.0",
        scalaVersion := "2.10.6",
        unmanagedJars in Compile += file("/home/sanbai/hadoop-2.7.2/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar"),
        libraryDependencies ++= Seq(
            "org.apache.spark" % "spark-core_2.10" % "1.6.2",
            "org.apache.hadoop" % "hadoop-client" % "2.7.2"
        )
    )

After building the jar file, we could submit it to spark (I run my spark on yarn-cluster mode):

./bin/spark-submit --class TerasortApp \
  --master yarn \
  --deploy-mode cluster \
  --driver-memory 2000M \
  --executor-memory 2000M \
  --executor-cores 1 \
  --num-executors 128 \
  --conf spark.yarn.executor.memoryOverhead=2048 \
  --conf spark.shuffle.memoryFraction=0.9 \
  --conf spark.storage.memoryFraction=0.9 \
  --conf "spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=85" \
  --queue spark \
  /home/sanbai/myspark/target/scala-2.10/Terasort_2.10-1.0.jar

It costs 17 minutes to complete the task, but tool “terasort” from Hadoop only costs 8 minutes to sort all data. The reason is I haven’t use TotalOrderPartitioner so spark has to sort all the data between different partitions (also between different servers) which costs a lot of network resource and delay the progress.

Remember to use scala-2.10 to build app for Spark-1.6.x, otherwise spark will report error like:
scala.runtime.VolatileObjectRef.zero()Lscala/runtime/VolatileObjectRef

Robin on Linux

terasort