Example datasets for learning Hive

I find two datasets: employee and salary for learning and practicing. After putting two files into HDFS, we just need to create tables:

create external table employee (
    employee_id INT,
    birthday DATE,
    first_name STRING,
    family_name STRING,
    gender CHAR(1),
    work_day DATE)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
with serdeproperties (
    "separatorChar" = ",",
    "quoteChar"     = "'"
)
stored as textfile
location '/employee/';
create external table salary (
    employee_id INT,
    salary INT,
    start_date DATE,
    end_date DATE)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
with serdeproperties (
    "separatorChar" = ",",
    "quoteChar"     = "'"
)
stored as textfile
location '/salary/';

Now we could analyze the data.
Find the oldest 10 employees.

select * from employee order by birthday asc limit 10;

Find all the employees joined the corporation in January 1990.

select * from employee where work_day >= '1990-01-01' and work_day <= '1990-01-31';

Find the top 10 employees earned the highest average salary. Notice we use 'order by' here because 'sort by' only produce local order in reducer.

select e.first_name, e.family_name, avg(s.salary) as avg_salary from
    employee as e join salary as s on (e.employee_id == s.employee_id)
        group by e.first_name, e.family_name order by avg_salary limit 10;

Let's find out whether this corporation has sex discrimination:

  SELECT e.gender, AVG(s.salary) AS avg_salary
    FROM employee AS e
          JOIN salary AS s
            ON (e.employee_id == s.employee_id)
GROUP BY e.gender;

The result is:

F       63767.607741168045
M       63839.90097030445

Looks good 🙂

Robin on Linux

Example datasets for learning Hive

Leave a Reply Cancel reply

Robin on Linux

Related Posts

Leave a Reply Cancel reply