Use hive to join two datasets

In previous article, I write java code of MapReduce-Framework to join two datasets. Furthermore, I enhanced the code to sort by scores for every student. The complete join-and-sort code is here. It need more than 170 lines of java code to join two tables and sort it. But in product environment, we usually use Hive to do the same work.
By using the same sample datasets:

#create two external tables in Hive.
create external table student(id INT, name STRING)
row format delimited
fields terminated by ','
lines terminated by '\n'
stored as textfile
location '/users/';
create external table course(id INT, course STRING, score TINYINT)
row format delimited
fields terminated by ','
lines terminated by'\n'
stored as textfile
location '/courses/';

Now we could join them:

select s.name, c.course, c.score from student as s
    join course as c on (s.id == c.id)
        sort by s.name, c.score desc;

Just three lines of HQL (Hive Query Language), not 170 lines of java code.
These two tables are very small, thus we could use local mode to run Hive task:

set hive.exec.mode.local.auto=true;

Robin on Linux

Use hive to join two datasets

Leave a Reply Cancel reply

Robin on Linux

Related Posts

Leave a Reply Cancel reply