hadoop

Problems about using DistCp on Hadoop

After installing all Hadoop environment, I used DistCp to copy large files in distributed cluster. But it report error:

#hadoop distcp
Error: A JNI error has occurred, please check your installation and try again
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/mapreduce/Job
        at java.lang.Class.getDeclaredMethods0(Native Method)
        at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
        at java.lang.Class.privateGetMethodRecursive(Class.java:3048)
        at java.lang.Class.getMethod0(Class.java:3018)
        at java.lang.Class.getMethod(Class.java:1784)
        at sun.launcher.LauncherHelper.validateMainClass(LauncherHelper.java:544)
        at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:526)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.mapreduce.Job
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        ... 7 more

Seems it can’t even find the basic MapReduce class. Then I checked CLASSPATH for Hadoop:

#hadoop classpath
/usr/lib/hadoop-2.6.0/etc/hadoop/:/usr/lib/hadoop-2.6.0/share/hadoop/common/lib/*:/usr/lib/hadoop-2.6.0/share/hadoop/common/*:/usr/lib/hadoop-2.6.0//share/hadoop/hdfs:/usr/lib/hadoop-2.6.0//share/hadoop/hdfs/lib/*:/usr/lib/hadoop-2.6.0//share/hadoop/hdfs/*:/usr/lib/hadoop-2.6.0/share/hadoop/yarn/lib/*:/usr/lib/hadoop-2.6.0/share/hadoop/yarn/*:/usr/lib/hadoop-mapreduce/share/hadoop/mapreduce/*

Pretty strange, the HADOOP_CLASSPATH contains ‘mapreduce’ directories. It supposed to be able to find ‘Job’ class, unless the MapReduce jar package is in other directories.
Finally, I found the real MapReduce jar is actually in other position. Therefore I add these directories into HADOOP_CLASSPATH: edit ~/.bashrc and add following line

#.bashrc
export HADOOP_CLASSPATH=${HADOOP_CLASSPATH}:/usr/lib/hadoop-2.6.0/share/hadoop/mapreduce/*:/usr/lib/hadoop-2.6.0/share/hadoop/mapreduce/lib/*

DistCp could work now.

Read paper “iShuffle: Improving Hadoop Performance with Shuffle-on-Write”

Paper reference: iShuffle: Improving Hadoop Performance with Shuffle-on-Write
Background:
A job in Hadoop consists of three main stages: map, shuffle, reduce (Actually shuffle stage has been contained into reduce stage).

What is the problem?
Shuffle phase need to migrate large mount of data from nodes which running map job to those nodes which intend to run reduce job. This cause shuffle-latency which is usually significant. And the reason is:

Partitioning skew: Hadoop use hash algorithm to organize output data of map task, if too many keys have the same hash, it may cause unevenly size of partitions
Coupling of shuffle and reduce: data shuffling can’t be overlapped with map tasks

Solution: iShuffle

“Shuffler” collect intermediate data generated by every map task and predict size of respective partition
“Shuffler Manager” collect informations from “Shuffler” and decide the position of partitions

Shuffle-on-Write：While a map task writing a spill to local filesystem, it will (by modification of Hadoop code) also write spill to correspondent node where reduce task will be launched
Automated map out placement: iShuffle will decide the position for every partition by “map selectivity”, which is the ratio of map input size and map output size. After predicting the “map selectivity” and knowing the total input size of data, iShuffle could finally choose optimist node for every partition data

Flexible reduce scheduling: when a node request a reduce task, the Task Manager(after modification of Hadoop’s FIFO scheduler) will find the list of partitions reside on this node, and launch reduce task only for these partitions (Make sure reduce task will only read shuffled data from local filesystem which will reduce network-bandwidth at reduce stage)

In my opinion
Using prediction technology to proactively move map output to apt nodes, which avoid partition skew, is the most intelligent part of this paper. This tech could also be used to other intermediate data moving scenario like OLAP in Data-warehouse.
But, I also suspect that in real production, not too many organizations will use this “iShuffle” as they usually run multi-user applications in Hadoop system. When a lot of jobs running in one Hadoop cluster simultaneously, a low peak of CPU-usage made by long reduce latency of one job will be compensated by other computing-intensive jobs. Therefore to all users, none of hardware resources are wasted.

Install CDH(Cloudera Distribution Hadoop) by Cloudera Manager

These days I was trying to install Cloudera-5.8.3 on my centos-7 machines, and here are some steps for operation and tips for trouble shooting:
0. If you are not in USA, the speed of network for accessing Cloudera Repository of RPMS(or Parcels) is desperately slow, thus we need to move CM (Cloudera Manager) Repo and CDH Repo to local.
Create local CM Repo
Create local CDH Repo
1. Install Cloudera Manager (steps)
2. Start Cloudera Manager

sudo cmf-server start

But it report:

org.springframework.beans.factory.support.FactoryBeanRegistrySupport.doGetObjectFromFactoryBean(FactoryBeanRegistrySupport.java:142)
... 22 more
Caused by: org.hibernate.service.classloading.spi.ClassLoadingException: HHH010003: JDBC Driver class not found: com.mysql.jdbc.Driver
at org.hibernate.service.jdbc.connections.internal.C3P0ConnectionProvider.configure(C3P0ConnectionProvider.java:142)
at org.hibernate.service.internal.StandardServiceRegistryImpl.configureService(StandardServiceRegistryImpl.java:75)

In centos-7, the solution is:

# Install Mysql Driver for Java
sudo yum install mysql-connector-java -y
# Set jar to CLASSPATH
export CMF_JDBC_DRIVER_JAR=/usr/share/java/mysql-connector-java.jar
# Start Cloudera Manager again
sudo cmf-server start

Also need to run “sudo ./cloudera-manager-installer.bin –skip_repo_package=1” to create “db.properties”.
3. Login to the Cloudera Manager(port: 7180) and follow the steps of Wizard to create a new cluster. (Choose the local repository for installation will bring favorable fast speed 🙂
Make sure the hostname of every node is correct. And by using “Host Inspector”, we can reveal many potential problems in these machines.
After tried many times to setup cluster, I found this error in logs of some nodes:

Error, CM server guid updated, expected 85587073-270d-43d9-a44a-e213d9f7e45b, received 4c1402a5-8364-4598-a382-0c760710e897

The solution is simple:

#For the error node
sudo rm -rf /var/lib/cloudera-scm-agent/cm_guid

and restart Cloudera Manager Agent on these nodes.
I also confronted a problem that installation progress has hanged on this message:

Acquiring installation lock...

There isn’t any process of “yum” running in the node, so why it still acquire installation lock? The answer is:

sudo rm -rf /tmp/.scm_prepare_node.lock

4. After many fails and retry, I eventually setup the Hadoop Ecosystem of CDH:

When upgrading or downgrading a Cloudera Cluster, your may see this problem:

The solution is (if in ‘single user mode’):

sudo chown cloudera-scm:cloudera-scm /run/cloudera-scm-agent/ -R
sudo chown cloudera-scm:cloudera-scm /var/lib/cloudera-scm-agent/ -R

and try it again.
When staring ResourceManager, it failed and report:

2017-06-05 16:31:58,812 WARN org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Update thread interrupted. Exiting.
2017-06-05 16:31:58,813 WARN org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Continuous scheduling thread interrupted. Exiting.
java.lang.InterruptedException: sleep interrupted
    at java.lang.Thread.sleep(Native Method)
    at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:319)
2017-06-05 16:31:58,814 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService: Interrupted while waiting to reload alloc configuration
2017-06-05 16:31:58,814 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: AMLivelinessMonitor thread interrupted
2017-06-05 16:31:58,814 ERROR org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: ExpiredTokenRemover received java.lang.InterruptedException: sleep interrupted
2017-06-05 16:31:58,814 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.ContainerAllocationExpirer thread interrupted
2017-06-05 16:31:58,814 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: AMLivelinessMonitor thread interrupted
2017-06-05 16:31:58,815 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Transitioned to standby state
2017-06-05 16:31:58,816 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error starting ResourceManager
org.apache.hadoop.yarn.webapp.WebAppException: Error starting http server
    at org.apache.hadoop.yarn.webapp.WebApps$Builder.start(WebApps.java:278)
    at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startWepApp(ResourceManager.java:990)
    at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1090)
    at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
    at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1222)
Caused by: java.io.IOException: Problem in starting http server. Server handlers failed
    at org.apache.hadoop.http.HttpServer2.start(HttpServer2.java:912)
    at org.apache.hadoop.yarn.webapp.WebApps$Builder.start(WebApps.java:273)
    ... 4 more
2017-06-05 16:31:58,818 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: SHUTDOWN_MSG:

The reason of this error is: there is a Non-Cloudera version of zookeeper installed on the host. Remove it and reinstall zookeeper from CDH, the yarn-resource-manager will be launched successfully.
If meet “Deploy Client Configuration failed” when create new service, just add sudo nopassword to cloudera-scm user.

cloudera-scm    ALL=(ALL)       NOPASSWD: ALL

Using Pig to join two tables and sort it

Having two tables: salary and employee，we can use Pig to find the most high-salary employees:

salary = LOAD '/user/robin/salaries/salaries.csv' USING PigStorage(',') AS (uid:int, salary:int, begin:chararray, end:chararray);
employee = LOAD '/user/robin/employees/employees.csv' USING PigStorage(',') AS (uid:int, birth:chararray, givenname:chararray, familyname:chararray, gender:chararray, work:chararray);
jo = JOIN employee BY uid, salary BY uid;
res = ORDER (
             FOREACH (
                      GROUP jo BY (employee::uid, employee::birth, employee::givenname, employee::familyname, employee::gender, employee::work)
             )
             GENERATE group.employee::uid, group.employee::givenname, group.employee::familyname, AVG(jo.salary::salary) AS avg_salary
      ) BY avg_salary DESC;
fs -rmr /user/sanbai/join_result;
STORE res INTO '/user/robin/join_result' USING PigStorage(',');

The result is:

109334,'Tsutomu','Alameldin',141835.33333333334
205000,'Charmane','Griswold',141064.63636363635
43624,'Tokuyasu','Pesch',138492.94444444444
493158,'Lidong','Meriste',138312.875
37558,'Juichirou','Thambidurai',138215.85714285713
276633,'Shin','Birdsall',136711.73333333334
238117,'Mitsuyuki','Stanfel',136026.2
46439,'Ibibia','Junet',135747.73333333334
254466,'Honesty','Mukaidono',135541.0625
253939,'Sanjai','Luders',135042.25
....

Terasort for Spark (part2 / 2)

In previous article, we used Spark to sort large dataset generated by Teragen. But it cost too much time than Hadoop Mapreduce framework, so we are going to optimize it.
By looking at the Spark UI for profiling, we find out the “Shuffle” read/write too much data from/to the hard-disk, this will surely hurt the performance severely.

In “Terasort” of Hadoop, it use “class TotalOrderPartition” to map all the data to a large mount of partitions by ordering, so every “Reduce” job only need to sort data in one task (almost don’t need any shuffle from other partition). This will save a lot of network bandwidth and CPU usage.
Therefore we could modify our Scala code to sort every partition locally:

    logData.partitionBy(new TeraSortPartitioner(512))
      .mapPartitions(iter => {
        iter.toVector.sortBy(kv => kv._1.getBytes).iterator
      })
      .saveAsNewAPIHadoopFile[TeraOutputFormat]("hdfs://127.0.0.1/output")

and the spark-submit should also be changed:

./bin/spark-submit --class TerasortApp \
  --master yarn \
  --deploy-mode cluster \
  --driver-memory 2000M \
  --executor-memory 5200M \
  --executor-cores 1 \
  --num-executors 64 \
  --conf spark.yarn.executor.memoryOverhead=900 \
  --conf spark.shuffle.memoryFraction=0.6 \
  --conf spark.kryoserializer.buffer.max=2000m \
  --conf "spark.executor.extraJavaOptions=-XX:+UseG1GC" \
  --queue spark \
  /home/sanbai/myspark/target/scala-2.10/Terasort_2.10-1.0.jar

This time, the job only cost 10 minutes for sorting data!
Screenshot from “Job Browser” of Hue:

Terasort for Spark (part1 / 2)

We could use Spark to sort all the data which is generated by Teragen of Hadoop.
TerasortApp.scala

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.Partitioner
import org.apache.spark.rdd._
import org.apache.hadoop.examples.terasort.TeraInputFormat
import org.apache.hadoop.examples.terasort.TeraOutputFormat
import org.apache.hadoop.io.Text
import com.google.common.primitives.Longs
import com.google.common.primitives.UnsignedBytes
case class TeraSortPartitioner(numPartitions: Int) extends Partitioner {
  import TeraSortPartitioner._
  val rangePerPart = (max - min) / numPartitions
  override def getPartition(key: Any): Int = {
    val b = key.asInstanceOf[Text].getBytes()
    val prefix = Longs.fromBytes(0, b(0), b(1), b(2), b(3), b(4), b(5), b(6))
    (prefix / rangePerPart).toInt
  }
}
object TeraSortPartitioner {
  val min = Longs.fromBytes(0, 0, 0, 0, 0, 0, 0, 0)
  val max = Longs.fromBytes(0, -1, -1, -1, -1, -1, -1, -1)  // 0xff = -1
}
object TerasortApp {
  implicit val caseInsensitiveOrdering = UnsignedBytes.lexicographicalComparator
  def main(args: Array[String]) {
    val conf = new SparkConf()
      .registerKryoClasses(Array(classOf[Text]))
      .setAppName("Simple Application")
    val sc = new SparkContext(conf)
    var logData = sc.newAPIHadoopFile("hdfs://127.0.0.1/tera", classOf[TeraInputFormat], classOf[Text], classOf[Text])
    logData.partitionBy(new TeraSortPartitioner(logData.partitions.size))
      .sortBy(kv => kv._1.getBytes)
      .saveAsNewAPIHadoopFile[TeraOutputFormat]("hdfs://127.0.0.1/output")
  }
}

build.sbt

lazy val root = (project in file("."))
    .settings(
        name := "Terasort",
        version := "1.0",
        scalaVersion := "2.10.6",
        unmanagedJars in Compile += file("/home/sanbai/hadoop-2.7.2/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar"),
        libraryDependencies ++= Seq(
            "org.apache.spark" % "spark-core_2.10" % "1.6.2",
            "org.apache.hadoop" % "hadoop-client" % "2.7.2"
        )
    )

After building the jar file, we could submit it to spark (I run my spark on yarn-cluster mode):

./bin/spark-submit --class TerasortApp \
  --master yarn \
  --deploy-mode cluster \
  --driver-memory 2000M \
  --executor-memory 2000M \
  --executor-cores 1 \
  --num-executors 128 \
  --conf spark.yarn.executor.memoryOverhead=2048 \
  --conf spark.shuffle.memoryFraction=0.9 \
  --conf spark.storage.memoryFraction=0.9 \
  --conf "spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=85" \
  --queue spark \
  /home/sanbai/myspark/target/scala-2.10/Terasort_2.10-1.0.jar

It costs 17 minutes to complete the task, but tool “terasort” from Hadoop only costs 8 minutes to sort all data. The reason is I haven’t use TotalOrderPartitioner so spark has to sort all the data between different partitions (also between different servers) which costs a lot of network resource and delay the progress.

Remember to use scala-2.10 to build app for Spark-1.6.x, otherwise spark will report error like:
scala.runtime.VolatileObjectRef.zero()Lscala/runtime/VolatileObjectRef

Some problems about programming Mapreduce

1. After submitting job, the console report:

Error: java.lang.RuntimeException: readObject can't find class
        at org.apache.hadoop.mapreduce.lib.input.TaggedInputSplit.readClass(TaggedInputSplit.java:136)
        at org.apache.hadoop.mapreduce.lib.input.TaggedInputSplit.readFields(TaggedInputSplit.java:122)
        at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:71)
        at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:42)
        at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:372)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:754)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.ClassNotFoundException: Class School$ScoreMapper not found
        at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)
        at org.apache.hadoop.mapreduce.lib.input.TaggedInputSplit.readClass(TaggedInputSplit.java:134)
        ... 11 more

The reason is I forgot to setJarByClass():

job.setJarByClass(School.class);

2. When the job finished, I found the reducer haven’t run at all. The reason is I haven’t override the correct reduce() member function of Reducer so MapReduce Framework ignore it and didn’t report any notification or warning. To make sure we override the correct member function of parent class, we need to add annotation:

......
    @Override
    public void reduce(Text key, Iterable values, Context context)
......

Use MapReduce to join two datasets

The two datasets are:

#users.txt (student id, name)
1,Robin Dong
2,Timi Yang
3,Olive Xu
4,Jenny Xu
5,Elsa Dong
6,Coly Wang
7,Hulk Li
8,Judy Lao
9,Kevin Liu
10,House Zhang

#scores.txt (student id, course, score)
1,Math,90
1,Physics,80
3,Music,70
5,Math,80
7,Geography,70
1,Geography,60
2,Physics,70
6,Math,70
4,Music,90
6,Geography,75
9,Geography,85
10,Music,95
2,Physics,78
2,Music,73
2,Math,84
4,Math,61
4,Physics,65
5,Music,66
5,Math,90

To join the two tables above by “student id”, we need to use MultipleInputs. The code is:

import java.io.IOException;
import java.util.Iterator;
import java.util.ArrayList;
import java.util.List;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparator;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.InputFormat;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.MultipleInputs;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
public class School {
  public static class UserMapper
    extends Mapper {
    private String uid, name;
    @Override
    public void map(LongWritable key, Text value, Context context)
      throws IOException, InterruptedException {
      String line = value.toString();
      String arr[] = line.split(",");
      uid = arr[0].trim();
      name = arr[1].trim();
      context.write(new Text(uid), new Text(name));
    }
  }
  public static class ScoreMapper
      extends Mapper {
      private String uid, course, score;
      @Override
      public void map(LongWritable key, Text value, Context context)
        throws IOException, InterruptedException {
        String line = value.toString();
        String arr[] = line.split(",");
        uid = arr[0].trim();
        course = arr[1].trim();
        score = arr[2].trim();
        context.write(new Text(uid), new Text(course + "," + score));
      }
  }
  public static class InnerJoinReducer extends Reducer {
    @Override
    public void reduce(Text key, Iterable values, Context context)
    throws IOException, InterruptedException {
      String name = "";
      List courses = new ArrayList();
      List scores = new ArrayList();
      for (Text value : values) {
        String cur = value.toString();
        if (cur.contains(",")) {
          String arr[] = cur.split(",");
          courses.add(arr[0]);
          scores.add(arr[1]);
        } else {
          name = cur;
        }
      }
      if (!name.isEmpty() && !courses.isEmpty() && !scores.isEmpty()) {
        for (int i = 0; i < courses.size(); i++) {
          context.write(new Text(name), new Text(courses.get(i) + "," + scores.get(i)));
        }
      }
    }
  }
  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "School");
    job.setJarByClass(School.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(Text.class);
    MultipleInputs.addInputPath(job, new Path(args[0]), TextInputFormat.class, UserMapper.class);
    MultipleInputs.addInputPath(job, new Path(args[1]), TextInputFormat.class, ScoreMapper.class);
    job.setReducerClass(InnerJoinReducer.class);
    FileOutputFormat.setOutputPath(job, new Path(args[2]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

Compile and run it:

~/hadoop-2.7.2/bin/hadoop com.sun.tools.javac.Main School.java -Xlint:unchecked
jar cf school.jar School*.class
bin/hadoop jar ~/school.jar School /users.txt /scores.txt /my

And the result in /my is:

Robin Dong      Geography,60
Robin Dong      Physics,80
Robin Dong      Math,90
House Zhang     Music,95
Timi Yang       Physics,70
Timi Yang       Math,84
Timi Yang       Music,73
Timi Yang       Physics,78
Olive Xu        Music,70
Jenny Xu        Physics,65
Jenny Xu        Math,61
Jenny Xu        Music,90
Elsa Dong       Math,90
Elsa Dong       Music,66
Elsa Dong       Math,80
Coly Wang       Geography,75
Coly Wang       Math,70
Hulk Li Geography,70
Kevin Liu       Geography,85

Use MapReduce to find prime numbers

Just want to write a small example of MapReduce of Hadoop for finding prime numbers. The first question is: how could I generate numbers from 1 to 1000000 by my own application instead of reading from file of HDFS? The answer is: inherit the InputSplit, RecordReader, and InputFormat by yourself, just like teragen program
Then comes the second question: could I just use mapper without reducer stage? The answer is yes, simply use job.setNumReduceTasks(0) to disable reducer stage.
The complete code is here (I know the algorithm for checking a number for prime is naive, but it works):

import java.io.IOException;
import java.io.DataInput;
import java.io.DataOutput;
import java.util.ArrayList;
import java.util.List;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.WritableUtils;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.InputFormat;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class CalcPrime {
  public static final String SPLITS_NUM = "calcprime.splits.num";
  public static final String MAX_RANGE = "calcprime.range.max";
  public static final long DEFAULT_SLITS = 200;
  public static final long DEFAULT_MAX = 10000;
  public static class NumberInputFormat
      extends InputFormat {
      static class NumberInputSplit extends InputSplit implements Writable {
        long first;
        long count;
        public NumberInputSplit() {}
        public NumberInputSplit(long offset, long length) {
          first = offset;
          count = length;
        }
        public long getLength() throws IOException {
          return 0;
        }
        public String[] getLocations() throws IOException {
          return new String[]{};
        }
        public void readFields(DataInput in) throws IOException {
          first = WritableUtils.readVLong(in);
          count = WritableUtils.readVLong(in);
        }
        public void write(DataOutput out) throws IOException {
          WritableUtils.writeVLong(out, first);
          WritableUtils.writeVLong(out, count);
        }
      }
      static class NumberRecordReader
          extends RecordReader {
          long first;
          long count;
          long current;
          public NumberRecordReader() {}
          public void initialize(InputSplit split, TaskAttemptContext context)
            throws IOException, InterruptedException {
            first = ((NumberInputSplit)split).first;
            count = ((NumberInputSplit)split).count;
            current = first;
          }
          public void close() throws IOException {}
          public LongWritable getCurrentKey() {
            return new LongWritable(current);
          }
          public NullWritable getCurrentValue() {
            return NullWritable.get();
          }
          public float getProgress() throws IOException {
            return current / (float) count;
          }
          public boolean nextKeyValue() {
            if (current >= (count + first)) {
              return false;
            }
            current++;
            return true;
          }
      }
      public RecordReader
        createRecordReader(InputSplit split, TaskAttemptContext context)
        throws IOException {
          return new NumberRecordReader();
        }
      public List getSplits(JobContext job) {
        List splits = new ArrayList();
        long splitsNum = getSplitsNum(job);
        long maxRange = getMaxRange(job);
        for (int start = 0; start < splitsNum; ++start) {
          splits.add(new NumberInputSplit(start * maxRange, maxRange));
        }
        return splits;
      }
      public long getSplitsNum(JobContext job) {
        return job.getConfiguration().getLong(SPLITS_NUM, DEFAULT_SLITS);
      }
      public long getMaxRange(JobContext job) {
        return job.getConfiguration().getLong(MAX_RANGE, DEFAULT_MAX);
      }
  }
  public static class NumberMapper
    extends Mapper {
    public void map(LongWritable key, NullWritable value, Context context)
            throws IOException, InterruptedException {
            long lkey = key.get();
            if (lkey == 1) {
              return;
            }
            if (lkey == 2 || lkey == 3) {
              context.write(key, value);
              return;
            }
            long end = lkey / 2;
            for (int i = 2; i <= end; i++) {
              if (lkey % i == 0) {
                return;
              }
            }
            context.write(key, value);
    }
  }
  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "Calc Prime");
    long splitsNum = DEFAULT_SLITS;
    long maxRange = DEFAULT_MAX;
    if (args.length > 1) {
      splitsNum = Long.parseLong(args[1]);
    }
    if (args.length > 2) {
      maxRange = Long.parseLong(args[2]);
    }
    job.getConfiguration().setLong(SPLITS_NUM, splitsNum);
    job.getConfiguration().setLong(MAX_RANGE, maxRange);
    FileOutputFormat.setOutputPath(job, new Path(args[0]));
    job.setJarByClass(CalcPrime.class);
    job.setMapperClass(NumberMapper.class);
    job.setNumReduceTasks(0);
    job.setOutputKeyClass(LongWritable.class);
    job.setOutputValueClass(NullWritable.class);
    job.setInputFormatClass(NumberInputFormat.class);
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

Copy the code to file CalcPrime.java, compile and run it:

/usr/local/hadoop-2.7.2/bin/hadoop com.sun.tools.javac.Main CalcPrime.java
jar cf prime.jar CalcPrime*.class
#Number of mapper task is 400, and every mapper process 1000000 numbers
/usr/local/hadoop-2.7.2/bin/hadoop jar ~/prime.jar CalcPrime /result 400 1000000

Some tips about Hive

Found some tips about Hive in my learning progress:
1. When I start “bin/hive” at first time, these errors report:

Exception in thread "main" java.lang.RuntimeException: Hive metastore database is not initialized. Please use schematool (e.g. ./schematool -initSchema -dbType ...) to create the schema. If needed, don't forget to include the option to auto-create the underlying database in your JDBC connection string (e.g. ?createDatabaseIfNotExist=true for mysql)

The solution is simple:

mv metastore_db metastore_db.tmp
schematool -initSchema -dbType derby

Actually, we’d better use mysql instead of derby for multi-users environment.
2. Control the number of mappers for SQL jobs. If a SQL job use too much mappers, the context-switch of processes (include frequent launch/stop for JVM) will cost extra CPU resource. We could use

set mapreduce.input.fileinputformat.split.maxsize=...
set mapreduce.input.fileinputformat.split.minsize=...

to change the number of mappers for all the SQL jobs.
3. After I imported 1TB data into a “Orc format” table, the size of the table is just 250GB. But after I imported 1TB data into a “Parquet format” table, the size is 900GB. Looks Apache Orc has more effective compression algorithm for custom data.
4. Using partitions carefully.

create table users (name string, age smallint) partitioned by (ca string);

Now we have a table named “users” and is partitioned by field “ca”.

hive> insert into users values("robindong", 36);
FAILED: SemanticException 1:12 Need to specify partition columns because the destination table is partitioned. Error encountered near token 'users'
hive> insert into users values("robindong", 36, "China");
FAILED: SemanticException 1:12 Need to specify partition columns because the destination table is partitioned. Error encountered near token 'users'

We can’t using normal INSERT clause to insert record into partitioned table. Trying this:

insert into users partition (ca="China") values("robindong", 36);

Now, there is a record in HDFS directory “/user/hive/warehouse/users/ca=China/”
In the book “Programming Hive”, it said we could copy the data in a partition directory to AWS s3 and then set partition to it. But, what if I set the partition to a new empty HDFS directory? Let’s try:

alter table users partition(ca = 'China') set location '/empty/';
hive> select * from users where ca='China';
OK
Time taken: 0.298 seconds

Because the partition has been set to a empty directory, the select couldn’t find any records now. That is what “Schema on read” mean.
5. Debug.

bin/hive --hiveconf hive.root.logger=DEBUG,console

This will print many debug information for finding causes such as:

aused by: java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at org.apache.hive.common.util.ReflectionUtil.setJobConf(ReflectionUtil.java:112)
        ... 20 more
Caused by: java.lang.IllegalArgumentException: Compression codec com.hadoop.compression.lzo.LzoCodec not found.
        at org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:139)
        at org.apache.hadoop.io.compress.CompressionCodecFactory.(CompressionCodecFactory.java:179)
        at org.apache.hadoop.mapred.TextInputFormat.configure(TextInputFormat.java:45)
        ... 25 more
Caused by: java.lang.ClassNotFoundException: Class com.hadoop.compression.lzo.LzoCodec not found
        at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)
        at org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:132)
        ... 27 more