I find two datasets: employee and salary for learning and practicing. After putting two files into HDFS, we just need to create tables:
create external table employee (
employee_id INT,
birthday DATE,
first_name STRING,
family_name STRING,
gender CHAR(1),
work_day DATE)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
with serdeproperties (
"separatorChar" = ",",
"quoteChar" = "'"
)
stored as textfile
location '/employee/';
create external table salary (
employee_id INT,
salary INT,
start_date DATE,
end_date DATE)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
with serdeproperties (
"separatorChar" = ",",
"quoteChar" = "'"
)
stored as textfile
location '/salary/';
Now we could analyze the data.
Find the oldest 10 employees.
select * from employee order by birthday asc limit 10;
Find all the employees joined the corporation in January 1990.
select * from employee where work_day >= '1990-01-01' and work_day <= '1990-01-31';
Find the top 10 employees earned the highest average salary. Notice we use 'order by' here because 'sort by' only produce local order in reducer.
select e.first_name, e.family_name, avg(s.salary) as avg_salary from
employee as e join salary as s on (e.employee_id == s.employee_id)
group by e.first_name, e.family_name order by avg_salary limit 10;
Let's find out whether this corporation has sex discrimination:
SELECT e.gender, AVG(s.salary) AS avg_salary
FROM employee AS e
JOIN salary AS s
ON (e.employee_id == s.employee_id)
GROUP BY e.gender;
The result is:
F 63767.607741168045 M 63839.90097030445
Looks good 🙂