Last year, I imported two datasets to Hive. Currently, I will load two these two datasets into Amazon RedShift instead.
After created a RedShift Cluster in my VPC, I couldn’t connect to it even with Elastic IP. Then I check the parameters of my VPC between AWS’s default VPC, and eventually saw the vital differences. First, set “Network ACL” in “VPC” of AWS:
Then, add rule in “Route table”, which let node to access Anywhere(0.0.0.0/0) through “Internet Gateway” (also created in “VPC” service):
Now I could connect to my RedShift cluster.
Create s3 bucket by AWS Cli:
aws s3 mb s3://robin-data-023 --region us-west-2
Upload two csv files into bucekt:
aws s3 cp salaries.csv s3://robin-data-023/ aws s3 cp employees.csv s3://robin-data-023/
Create tables in Redshift by using SQL-Bench:
create table employee ( employee_id INTEGER primary key distkey, birthday DATE sortkey, first_name VARCHAR(64), family_name VARCHAR(64), gender CHAR(1), work_day DATE ); create table salary ( employee_id INTEGER primary key distkey, salary INTEGER, start_date DATE sortkey, end_date DATE );
Don’t put blank space or tab(‘\t’) before column name when creating table. or else Redshift will consider column name as
Load data from s3 to RedShift by COPY, the powerful tool for ETL in AWS.
copy employee from 's3://robin-data-023/employees.csv' iam_role 'arn:aws:iam::589631040421:role/fullRedshift' csv quote as '\''; copy salary from 's3://robin-data-023/salaries.csv' iam_role 'arn:aws:iam::589631040421:role/fullRedshift' csv quote as '\'';
We could see the success report like this:
Warnings: Load into table 'employee' completed, 300024 record(s) loaded successfully. 0 rows affected COPY executed successfully Execution time: 21.84s Warnings: Load into table 'salary' completed, 2819810 record(s) loaded successfully. 0 rows affected COPY executed successfully Execution time: 19.66s
There are “Warnings” but “successfully”, a little weird. But don’t worry, it’s ok for SQL-Bench.
Currently we could run this script which was wrote last year (But need to change ‘==’ to ‘=’ for compatible problem):
SELECT e.gender, AVG(s.salary) AS avg_salary FROM employee AS e JOIN salary AS s ON (e.employee_id = s.employee_id) GROUP BY e.gender;
The result is