Example datasets for learning Hive

I find two datasets: employee and salary for learning and practicing. After putting two files into HDFS, we just need to create tables:

Now we could analyze the data.

Find the oldest 10 employees.

Find all the employees joined the corporation in January 1990.

Find the top 10 employees earned the highest average salary. Notice we use ‘order by’ here because ‘sort by’ only produce local order in reducer.

Let’s find out whether this corporation has sex discrimination:

The result is:

Looks good 🙂


    Partitioning and Bucketing Hive table – Robin On Linux

    […] previous article, we use sample datasets to join two tables in Hive. To promote the performance of table join, we […]

    Using Pig to join two tables and sort it – Robin On Linux

    […] two tables: salary and employee,we can use Pig to find the most high-salary […]

    Example datasets for Amazon RedShift – Robin On Linux

    […] year, I imported two datasets to Hive. Currently, I will load two these two datasets into Amazon RedShift instead. After created a […]

    Data Join in AWS Redshift – Robin On Linux

    […] take ‘salary’ and ’employee’ for […]

Leave a Reply

Your email address will not be published. Required fields are marked *


This site uses Akismet to reduce spam. Learn how your comment data is processed.