Example datasets for Amazon RedShift

Last year, I imported two datasets to Hive. Currently, I will load two these two datasets into Amazon RedShift instead.
After created a RedShift Cluster in my VPC, I couldn’t connect to it even with Elastic IP. Then I check the parameters of my VPC between AWS’s default VPC, and eventually saw the vital differences. First, set “Network ACL” in “VPC” of AWS:

Then, add rule in “Route table”, which let node to access Anywhere( through “Internet Gateway” (also created in “VPC” service):

Now I could connect to my RedShift cluster.

Create s3 bucket by AWS Cli:

Upload two csv files into bucekt:

Create tables in Redshift by using SQL-Bench:

Don’t put blank space or tab(‘\t’) before column name when creating table. or else Redshift will consider column name as
”     employee_id”
”     salary”

Load data from s3 to RedShift by COPY, the powerful tool for ETL in AWS.

We could see the success report like this:

There are “Warnings” but “successfully”, a little weird. But don’t worry, it’s ok for SQL-Bench.

Currently we could run this script which was wrote last year (But need to change ‘==’ to ‘=’ for compatible problem):

The result is

One Comment

    Data Preprocessing in Tableau – Robin On Linux

    […] previous article, I created two tables in my Redshift Cluster. Now I wan’t to find out the relation between […]

Leave a Reply

Your email address will not be published. Required fields are marked *