Monthly Archives: July 2016

Partitioning and Bucketing Hive table

In previous article, we use sample datasets to join two tables in Hive. To promote the performance of table join, we could also use Partition or Bucket. Let’s first create a parquet format table with partition and bucket:

CREATE TABLE employee_p (
    employee_id INT,
    birthday DATE,
    first_name STRING,
    family_name STRING,
    work_day DATE)
PARTITIONED BY (gender CHAR(1))
CLUSTERED BY (employee_id) INTO 8 BUCKETS
STORED AS PARQUET;

Then import data into it:

SET hive.exec.dynamic.partition.mode=nonstrict;
INSERT OVERWRITE TABLE employee_p PARTITION(gender)
                SELECT employee_id, birthday, first_name, family_name, gender, work_day
                  FROM employee;

But it reports error:

Diagnostic Messages for this Task:
Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{},"value":{"_col0":"499795
","_col1":"1961-07-05","_col2":"Idoia","_col3":"Riefers","_col4":"M","_col5":"1986-01-16"}}
        at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:257)
        at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{},"value":{"_col0":"499795","_col1":"1961-07-05","
_col2":"Idoia","_col3":"Riefers","_col4":"M","_col5":"1986-01-16"}}
        at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:245)
        ... 7 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveFatalException: [Error 20004]: Fatal error occurred when node tried to create too many dynamic partitions. The maximu
m number of dynamic partitions is controlled by hive.exec.max.dynamic.partitions and hive.exec.max.dynamic.partitions.pernode. Maximum was set to: 100
        at org.apache.hadoop.hive.ql.exec.FileSinkOperator.getDynOutPaths(FileSinkOperator.java:922)
        at org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:699)
        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837)
        at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:97)
        at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:236)
        ... 7 more

All the employees have only two genders: “M” and “F”. How could Hive report “too many dynamic partitions”?
To look for the fundamental cause, I use “explain” before my HQL, and finally noticed by this line:

expressions: UDFToInteger(VALUE._col0) (type: int), CAST( VALUE._col1 AS DATE) (type: date), VALUE._col2 (type: string), VALUE._col3 (type: string), CAST( VALUE._col4 AS DATE) (type: date), VALUE._col5 (type: string)

Hive use “_col4” as partition column and it’s type is DATE! So the correct import HQL should put partition column at last:

INSERT OVERWRITE TABLE employee_p PARTITION(gender)
                SELECT employee_id, birthday, first_name, family_name, work_day, gender
                  FROM employee;

We successfully import data by dynamic partitions.
Now we create new parquet format table “salary” (using buckets) and join two tables:

CREATE TABLE salary_p (
    employee_id INT,
    salary INT,
    start_date DATE,
    end_date DATE)
CLUSTERED BY (employee_id) INTO 8 BUCKETS
STORED AS PARQUET;
INSERT OVERWRITE TABLE salary_p SELECT * FROM salary;
/* Join two tables */
SELECT  e.gender, AVG(s.salary) AS avg_salary
  FROM  employee_p AS e
        JOIN  salary_p as s
        ON (e.employee_id == s.employee_id)
  GROUP BY e.gender;

The join operation only cost 90 seconds, much smaller than previous 140 seconds without bucketing and partitioning.

Example datasets for learning Hive

I find two datasets: employee and salary for learning and practicing. After putting two files into HDFS, we just need to create tables:

create external table employee (
    employee_id INT,
    birthday DATE,
    first_name STRING,
    family_name STRING,
    gender CHAR(1),
    work_day DATE)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
with serdeproperties (
    "separatorChar" = ",",
    "quoteChar"     = "'"
)
stored as textfile
location '/employee/';
create external table salary (
    employee_id INT,
    salary INT,
    start_date DATE,
    end_date DATE)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
with serdeproperties (
    "separatorChar" = ",",
    "quoteChar"     = "'"
)
stored as textfile
location '/salary/';

Now we could analyze the data.
Find the oldest 10 employees.

select * from employee order by birthday asc limit 10;

Find all the employees joined the corporation in January 1990.

select * from employee where work_day >= '1990-01-01' and work_day <= '1990-01-31';

Find the top 10 employees earned the highest average salary. Notice we use 'order by' here because 'sort by' only produce local order in reducer.

select e.first_name, e.family_name, avg(s.salary) as avg_salary from
    employee as e join salary as s on (e.employee_id == s.employee_id)
        group by e.first_name, e.family_name order by avg_salary limit 10;

Let's find out whether this corporation has sex discrimination:

  SELECT e.gender, AVG(s.salary) AS avg_salary
    FROM employee AS e
          JOIN salary AS s
            ON (e.employee_id == s.employee_id)
GROUP BY e.gender;

The result is:

F       63767.607741168045
M       63839.90097030445

Looks good 🙂

Use hive to join two datasets

In previous article, I write java code of MapReduce-Framework to join two datasets. Furthermore, I enhanced the code to sort by scores for every student. The complete join-and-sort code is here. It need more than 170 lines of java code to join two tables and sort it. But in product environment, we usually use Hive to do the same work.
By using the same sample datasets:

#create two external tables in Hive.
create external table student(id INT, name STRING)
row format delimited
fields terminated by ','
lines terminated by '\n'
stored as textfile
location '/users/';
create external table course(id INT, course STRING, score TINYINT)
row format delimited
fields terminated by ','
lines terminated by'\n'
stored as textfile
location '/courses/';

Now we could join them:

select s.name, c.course, c.score from student as s
    join course as c on (s.id == c.id)
        sort by s.name, c.score desc;

Just three lines of HQL (Hive Query Language), not 170 lines of java code.
These two tables are very small, thus we could use local mode to run Hive task:

set hive.exec.mode.local.auto=true;

Some problems about programming Mapreduce

1. After submitting job, the console report:

Error: java.lang.RuntimeException: readObject can't find class
        at org.apache.hadoop.mapreduce.lib.input.TaggedInputSplit.readClass(TaggedInputSplit.java:136)
        at org.apache.hadoop.mapreduce.lib.input.TaggedInputSplit.readFields(TaggedInputSplit.java:122)
        at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:71)
        at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:42)
        at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:372)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:754)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.ClassNotFoundException: Class School$ScoreMapper not found
        at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)
        at org.apache.hadoop.mapreduce.lib.input.TaggedInputSplit.readClass(TaggedInputSplit.java:134)
        ... 11 more

The reason is I forgot to setJarByClass():

job.setJarByClass(School.class);

2. When the job finished, I found the reducer haven’t run at all. The reason is I haven’t override the correct reduce() member function of Reducer so MapReduce Framework ignore it and didn’t report any notification or warning. To make sure we override the correct member function of parent class, we need to add annotation:

......
    @Override
    public void reduce(Text key, Iterable values, Context context)
......

Use MapReduce to join two datasets

The two datasets are:

#users.txt (student id, name)
1,Robin Dong
2,Timi Yang
3,Olive Xu
4,Jenny Xu
5,Elsa Dong
6,Coly Wang
7,Hulk Li
8,Judy Lao
9,Kevin Liu
10,House Zhang

#scores.txt (student id, course, score)
1,Math,90
1,Physics,80
3,Music,70
5,Math,80
7,Geography,70
1,Geography,60
2,Physics,70
6,Math,70
4,Music,90
6,Geography,75
9,Geography,85
10,Music,95
2,Physics,78
2,Music,73
2,Math,84
4,Math,61
4,Physics,65
5,Music,66
5,Math,90

To join the two tables above by “student id”, we need to use MultipleInputs. The code is:

import java.io.IOException;
import java.util.Iterator;
import java.util.ArrayList;
import java.util.List;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparator;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.InputFormat;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.MultipleInputs;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
public class School {
  public static class UserMapper
    extends Mapper {
    private String uid, name;
    @Override
    public void map(LongWritable key, Text value, Context context)
      throws IOException, InterruptedException {
      String line = value.toString();
      String arr[] = line.split(",");
      uid = arr[0].trim();
      name = arr[1].trim();
      context.write(new Text(uid), new Text(name));
    }
  }
  public static class ScoreMapper
      extends Mapper {
      private String uid, course, score;
      @Override
      public void map(LongWritable key, Text value, Context context)
        throws IOException, InterruptedException {
        String line = value.toString();
        String arr[] = line.split(",");
        uid = arr[0].trim();
        course = arr[1].trim();
        score = arr[2].trim();
        context.write(new Text(uid), new Text(course + "," + score));
      }
  }
  public static class InnerJoinReducer extends Reducer {
    @Override
    public void reduce(Text key, Iterable values, Context context)
    throws IOException, InterruptedException {
      String name = "";
      List courses = new ArrayList();
      List scores = new ArrayList();
      for (Text value : values) {
        String cur = value.toString();
        if (cur.contains(",")) {
          String arr[] = cur.split(",");
          courses.add(arr[0]);
          scores.add(arr[1]);
        } else {
          name = cur;
        }
      }
      if (!name.isEmpty() && !courses.isEmpty() && !scores.isEmpty()) {
        for (int i = 0; i < courses.size(); i++) {
          context.write(new Text(name), new Text(courses.get(i) + "," + scores.get(i)));
        }
      }
    }
  }
  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "School");
    job.setJarByClass(School.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(Text.class);
    MultipleInputs.addInputPath(job, new Path(args[0]), TextInputFormat.class, UserMapper.class);
    MultipleInputs.addInputPath(job, new Path(args[1]), TextInputFormat.class, ScoreMapper.class);
    job.setReducerClass(InnerJoinReducer.class);
    FileOutputFormat.setOutputPath(job, new Path(args[2]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

Compile and run it:

~/hadoop-2.7.2/bin/hadoop com.sun.tools.javac.Main School.java -Xlint:unchecked
jar cf school.jar School*.class
bin/hadoop jar ~/school.jar School /users.txt /scores.txt /my

And the result in /my is:

Robin Dong      Geography,60
Robin Dong      Physics,80
Robin Dong      Math,90
House Zhang     Music,95
Timi Yang       Physics,70
Timi Yang       Math,84
Timi Yang       Music,73
Timi Yang       Physics,78
Olive Xu        Music,70
Jenny Xu        Physics,65
Jenny Xu        Math,61
Jenny Xu        Music,90
Elsa Dong       Math,90
Elsa Dong       Music,66
Elsa Dong       Math,80
Coly Wang       Geography,75
Coly Wang       Math,70
Hulk Li Geography,70
Kevin Liu       Geography,85

Use MapReduce to find prime numbers

Just want to write a small example of MapReduce of Hadoop for finding prime numbers. The first question is: how could I generate numbers from 1 to 1000000 by my own application instead of reading from file of HDFS? The answer is: inherit the InputSplit, RecordReader, and InputFormat by yourself, just like teragen program
Then comes the second question: could I just use mapper without reducer stage? The answer is yes, simply use job.setNumReduceTasks(0) to disable reducer stage.
The complete code is here (I know the algorithm for checking a number for prime is naive, but it works):

import java.io.IOException;
import java.io.DataInput;
import java.io.DataOutput;
import java.util.ArrayList;
import java.util.List;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.WritableUtils;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.InputFormat;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class CalcPrime {
  public static final String SPLITS_NUM = "calcprime.splits.num";
  public static final String MAX_RANGE = "calcprime.range.max";
  public static final long DEFAULT_SLITS = 200;
  public static final long DEFAULT_MAX = 10000;
  public static class NumberInputFormat
      extends InputFormat {
      static class NumberInputSplit extends InputSplit implements Writable {
        long first;
        long count;
        public NumberInputSplit() {}
        public NumberInputSplit(long offset, long length) {
          first = offset;
          count = length;
        }
        public long getLength() throws IOException {
          return 0;
        }
        public String[] getLocations() throws IOException {
          return new String[]{};
        }
        public void readFields(DataInput in) throws IOException {
          first = WritableUtils.readVLong(in);
          count = WritableUtils.readVLong(in);
        }
        public void write(DataOutput out) throws IOException {
          WritableUtils.writeVLong(out, first);
          WritableUtils.writeVLong(out, count);
        }
      }
      static class NumberRecordReader
          extends RecordReader {
          long first;
          long count;
          long current;
          public NumberRecordReader() {}
          public void initialize(InputSplit split, TaskAttemptContext context)
            throws IOException, InterruptedException {
            first = ((NumberInputSplit)split).first;
            count = ((NumberInputSplit)split).count;
            current = first;
          }
          public void close() throws IOException {}
          public LongWritable getCurrentKey() {
            return new LongWritable(current);
          }
          public NullWritable getCurrentValue() {
            return NullWritable.get();
          }
          public float getProgress() throws IOException {
            return current / (float) count;
          }
          public boolean nextKeyValue() {
            if (current >= (count + first)) {
              return false;
            }
            current++;
            return true;
          }
      }
      public RecordReader
        createRecordReader(InputSplit split, TaskAttemptContext context)
        throws IOException {
          return new NumberRecordReader();
        }
      public List getSplits(JobContext job) {
        List splits = new ArrayList();
        long splitsNum = getSplitsNum(job);
        long maxRange = getMaxRange(job);
        for (int start = 0; start < splitsNum; ++start) {
          splits.add(new NumberInputSplit(start * maxRange, maxRange));
        }
        return splits;
      }
      public long getSplitsNum(JobContext job) {
        return job.getConfiguration().getLong(SPLITS_NUM, DEFAULT_SLITS);
      }
      public long getMaxRange(JobContext job) {
        return job.getConfiguration().getLong(MAX_RANGE, DEFAULT_MAX);
      }
  }
  public static class NumberMapper
    extends Mapper {
    public void map(LongWritable key, NullWritable value, Context context)
            throws IOException, InterruptedException {
            long lkey = key.get();
            if (lkey == 1) {
              return;
            }
            if (lkey == 2 || lkey == 3) {
              context.write(key, value);
              return;
            }
            long end = lkey / 2;
            for (int i = 2; i <= end; i++) {
              if (lkey % i == 0) {
                return;
              }
            }
            context.write(key, value);
    }
  }
  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "Calc Prime");
    long splitsNum = DEFAULT_SLITS;
    long maxRange = DEFAULT_MAX;
    if (args.length > 1) {
      splitsNum = Long.parseLong(args[1]);
    }
    if (args.length > 2) {
      maxRange = Long.parseLong(args[2]);
    }
    job.getConfiguration().setLong(SPLITS_NUM, splitsNum);
    job.getConfiguration().setLong(MAX_RANGE, maxRange);
    FileOutputFormat.setOutputPath(job, new Path(args[0]));
    job.setJarByClass(CalcPrime.class);
    job.setMapperClass(NumberMapper.class);
    job.setNumReduceTasks(0);
    job.setOutputKeyClass(LongWritable.class);
    job.setOutputValueClass(NullWritable.class);
    job.setInputFormatClass(NumberInputFormat.class);
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

Copy the code to file CalcPrime.java, compile and run it:

/usr/local/hadoop-2.7.2/bin/hadoop com.sun.tools.javac.Main CalcPrime.java
jar cf prime.jar CalcPrime*.class
#Number of mapper task is 400, and every mapper process 1000000 numbers
/usr/local/hadoop-2.7.2/bin/hadoop jar ~/prime.jar CalcPrime /result 400 1000000