Last year, I imported two datasets to Hive. Currently, I will load two these two datasets into Amazon RedShift instead.
After created a RedShift Cluster in my VPC, I couldn’t connect to it even with Elastic IP. Then I check the parameters of my VPC between AWS’s default VPC, and eventually saw the vital differences. First, set “Network ACL” in “VPC” of AWS:
Then, add rule in “Route table”, which let node to access Anywhere(0.0.0.0/0) through “Internet Gateway” (also created in “VPC” service):
Now I could connect to my RedShift cluster.
Create s3 bucket by AWS Cli:
aws s3 mb s3://robin-data-023 --region us-west-2
Upload two csv files into bucekt:
aws s3 cp salaries.csv s3://robin-data-023/ aws s3 cp employees.csv s3://robin-data-023/
Create tables in Redshift by using SQL-Bench:
create table employee (
employee_id INTEGER primary key distkey,
birthday DATE sortkey,
first_name VARCHAR(64),
family_name VARCHAR(64),
gender CHAR(1),
work_day DATE
);
create table salary (
employee_id INTEGER primary key distkey,
salary INTEGER,
start_date DATE sortkey,
end_date DATE
);
Don’t put blank space or tab(‘\t’) before column name when creating table. or else Redshift will consider column name as
” employee_id”
” salary”
…
Load data from s3 to RedShift by COPY, the powerful tool for ETL in AWS.
copy employee from 's3://robin-data-023/employees.csv' iam_role 'arn:aws:iam::589631040421:role/fullRedshift' csv quote as '\''; copy salary from 's3://robin-data-023/salaries.csv' iam_role 'arn:aws:iam::589631040421:role/fullRedshift' csv quote as '\'';
We could see the success report like this:
Warnings: Load into table 'employee' completed, 300024 record(s) loaded successfully. 0 rows affected COPY executed successfully Execution time: 21.84s Warnings: Load into table 'salary' completed, 2819810 record(s) loaded successfully. 0 rows affected COPY executed successfully Execution time: 19.66s
There are “Warnings” but “successfully”, a little weird. But don’t worry, it’s ok for SQL-Bench.
Currently we could run this script which was wrote last year (But need to change ‘==’ to ‘=’ for compatible problem):
SELECT e.gender, AVG(s.salary) AS avg_salary
FROM employee AS e
JOIN salary AS s
ON (e.employee_id = s.employee_id)
GROUP BY e.gender;
The result is