Monthly Archives: April 2017

Example datasets for Amazon RedShift

Last year, I imported two datasets to Hive. Currently, I will load two these two datasets into Amazon RedShift instead.
After created a RedShift Cluster in my VPC, I couldn’t connect to it even with Elastic IP. Then I check the parameters of my VPC between AWS’s default VPC, and eventually saw the vital differences. First, set “Network ACL” in “VPC” of AWS:

Then, add rule in “Route table”, which let node to access Anywhere(0.0.0.0/0) through “Internet Gateway” (also created in “VPC” service):

Now I could connect to my RedShift cluster.
Create s3 bucket by AWS Cli:

aws s3 mb s3://robin-data-023 --region us-west-2

Upload two csv files into bucekt:

aws s3 cp salaries.csv s3://robin-data-023/
aws s3 cp employees.csv s3://robin-data-023/

Create tables in Redshift by using SQL-Bench:

create table employee (
employee_id INTEGER primary key distkey,
birthday DATE sortkey,
first_name VARCHAR(64),
family_name VARCHAR(64),
gender CHAR(1),
work_day DATE
);
create table salary (
employee_id INTEGER primary key distkey,
salary INTEGER,
start_date DATE sortkey,
end_date DATE
);

Don’t put blank space or tab(‘\t’) before column name when creating table. or else Redshift will consider column name as
” employee_id”
” salary”
…

Load data from s3 to RedShift by COPY, the powerful tool for ETL in AWS.

copy employee
from 's3://robin-data-023/employees.csv'
iam_role 'arn:aws:iam::589631040421:role/fullRedshift'
csv quote as '\'';
copy salary
from 's3://robin-data-023/salaries.csv'
iam_role 'arn:aws:iam::589631040421:role/fullRedshift'
csv quote as '\'';

We could see the success report like this:

Warnings:
Load into table 'employee' completed, 300024 record(s) loaded successfully.
0 rows affected
COPY executed successfully
Execution time: 21.84s
Warnings:
Load into table 'salary' completed, 2819810 record(s) loaded successfully.
0 rows affected
COPY executed successfully
Execution time: 19.66s

There are “Warnings” but “successfully”, a little weird. But don’t worry, it’s ok for SQL-Bench.
Currently we could run this script which was wrote last year (But need to change ‘==’ to ‘=’ for compatible problem):

SELECT e.gender, AVG(s.salary) AS avg_salary
    FROM employee AS e
          JOIN salary AS s
            ON (e.employee_id = s.employee_id)
GROUP BY e.gender;

The result is

Enable audit log for AWS Redshift

When I was trying to enable the Audit Log for AWS Redshift, I chose to use a exists bucket in S3. But it reports error:

"Cannot read ACLs of bucket redshift-robin. Please ensure that your IAM permissions are set up correctly."
"Service: AmazonRedshift; Status Code: 400; Error Code: InsufficientS3BucketPolicyFault ...."

According to this document, I need to change permission of bucket “redshift-robin”. So I entered the AWS Console of S3, click bucket name of “redshift-robin” in left panel, and saw description of permissions:

Press “Add Bucket Policy”, and in the pop-out-window, press “AWS Policy Generator”. Here came the generator, which is easy to use for creating policy.
Add two policy for “redshift-robin”:

The “902366379725” is the account-id of us-west-2 region (Oregon)

Click “Generate Policy”, and copy the generated JSON to “Bucket Policy Editor”:

Press “Save”. Now, we could enable Audit Log of Redshift for bucket “redshift-robin”:

Read paper “In-Datacenter Performance Analysis of a Tensor Processing Unit”

Paper reference: In-Datacenter Performance Analysis of a Tensor Processing Unit”
Application
Using floating point (16bit or 32bit) for NN (Neural Network) training, then a step called quantization transforms floating-point numbers into narrow integers–often just 8 bits–which are usually good enough for inference.
MLP(Multi-layer Perceptions), CNN(Convolutional Neural Netowrks), and RNN(Recurrent Neural Networks), these three types of NN represent 95% of NN inference workload in Google datacenter. Therefore, the TPU mainly focus on them.

As we can see, CNNs are usually dense-computing NN, which are better for TPU.

TPU has 25 times as many MACs (Multiply and Accumulate) and 3.5 times as much on-chip memory as the K80 GPU.
Architecture
The TPU was designed to be a coprocessor on the PCIe I/O bus, more like FPU(floating-poin unit) than it is to a GPU.

The parameters of NN model (weights) comes from off-chip memory (8G DDR3 DRAM) to Weight FIFO, and then flow into MMU(Matrix Multiply Unit). The request (sample need to be inference) comes from PCIe to Unified Buffer, and also flow into MMU finally.
Even the “Activation” and “Pooling” algorithm in CNN have been fixed into hardware.

The MMU contains 256×256 MACs that can perform 8-bit multiply-and-adds on signed or unsigned integers.

According to this Floor Plan, we can imaging that UB and MMU might cost most energy of TPU.

TPU instructions follow the CISC tradition and only has about a dozen instructions, include “Read_Host_Memory”, “Read_Weights”, “MatrixMultiply”, “Activate” etc. Recalling how many codes we need to write to implement a effective Activation function, then we could conceive the speed of using only one “Activate” instruction in TPU.
This paper said TPU is a type of Systolic Array. But what is Systolic Array? Here is the explain: A systolic array is a network of processors that rhythmically compute and pass data through the system.
Performance
There are lot of tables and diagrams which show the top-rate performance of TPU. Although the TPU is fast, it also depend on the computing-density of applications. The CNNs are most computing-dense NN, so it gains most speed(or TeraOps per second) from TPU:

In this paper, it didn’t explain why the GPU is slower than TPU in inference operation. The only sentence about this topic is in “8 Discussion”: “GPUs have traditionally been seen as high-throughput architectures that reply on high-bandwidth DRAM and thousands of threads to achieve their goals”. Actually, I think this is not a serious explain.
The interesting thing is, after Google publish this paper, the CEO of Nvidia – Jensen Huang, wrote a blog to gently appeal a fact: the state-of-the-art GPU (Tesla P40) can inference faster than TPU. The war between different giants of Deep learning is just beginning.

Google TPU from Hao(Robin) Dong

Using antlr3 to generate C++ code

Need to parse SQL query to C++ code in project, so I had to learn antlr these days.
Let’s write a small sample file “Calc.g” for antlr3:

grammar Calc;
options {
    language = Cpp;
}
@lexer::header {
    #include 
    #include 
    #include 
    using namespace std;
}
@lexer::traits {
  class CalcLexer;
  class CalcParser;
  typedef antlr3::Traits CalcLexerTraits;
}
@parser::header {
  #include "./CalcLexer.hpp"
}
@parser::traits {
  typedef CalcLexerTraits CalcParserTraits;
}
@parser::context {
  typedef std::map StringMap;
  StringMap var_map;
}
prog:   (stat)+ ;
stat:   expr NEWLINE        { cout << $expr.value << endl; }
    |   ID '=' expr NEWLINE { string id = $ID.text;
                              var_map[id] = $expr.value;
                            }
    |   NEWLINE
    ;
expr returns [int value]
    :   a=multiExpr {$value = $a.value;}
        ( '+' b=multiExpr {$value += $b.value;}
        | '-' c=multiExpr {$value -= $c.value;}
        )*
    ;
multiExpr returns [int value]
    :   a=atom {$value = $a.value;}
        ('*' b=atom {$value *= $b.value;}
        )*
    ;
atom returns [int value]
    :   INT { $value = atoi($INT.text.c_str()); }
    |   ID  { int v = var_map[$ID.text];
              $value = v;
            }
    |   '(' expr ')' {$value = $expr.value;}
    ;
ID  :   ('a'..'z')+ ;
INT :   ('0'..'9')+ ;
NEWLINE: '\r'? '\n' ;
WS  :   (' '|'\t'|'\n'|'\r')+ {skip();} ;

Then add "antlr-3.5.2-complete.jar" (run "mvn package" on source code path of antlr3 will generate this jar) to CLASSPATH and run:

java org.antlr.Tool Calc.g

It will generate many code files: CalcLexer.[hpp/cpp], CalcParser.[hpp/cpp], Calc.tokens. Now we could compile all generated C++ codes:

g++ CalcLexer.cpp CalcParser.cpp Test.cpp -I$(ANTLR3_SRC_PATH)/runtime/Cpp/include/ -I./ -o Test

("ANTLR3_SRC_PATH" is where the antlr3 source code are
But this step lead compiler errors for g++:

CalcParser.cpp: In member function ‘void CalcParser::stat()’:
CalcParser.cpp:436:67: error: invalid conversion from ‘const CommonTokenType* {aka const antlr3::CommonToken >*}’ to ‘antlr3::Traits::CommonTokenType* {aka antlr3::CommonToken >*}’ [-fpermissive]
                  ID2 =  this->matchToken(ID, &FOLLOW_ID_in_stat106);
                                                                   ^
CalcParser.cpp: In member function ‘int CalcParser::atom()’:
CalcParser.cpp:836:70: error: invalid conversion from ‘const CommonTokenType* {aka const antlr3::CommonToken >*}’ to ‘antlr3::Traits::CommonTokenType* {aka antlr3::CommonToken >*}’ [-fpermissive]
                  INT4 =  this->matchToken(INT, &FOLLOW_INT_in_atom276);
                                                                      ^
CalcParser.cpp:854:67: error: invalid conversion from ‘const CommonTokenType* {aka const antlr3::CommonToken >*}’ to ‘antlr3::Traits::CommonTokenType* {aka antlr3::CommonToken >*}’ [-fpermissive]
                  ID5 =  this->matchToken(ID, &FOLLOW_ID_in_atom288);
                                                                   ^

The reason is 'ID' and 'INT' should be declare as "const CommonTokenType*", rather than "CommonTokenType*". The fix has already be commited to antlr3 and be contained in master branch of git tree. Therefore I checkout the master branch of antlr3 instead of "3.5.2" tag, re-package the jar of antlr3, re-generate the code for "Calc.g", and the compiler errors disappeared.
The target executable file is "Test", now we can use it to parse our "code":

echo "a=3
b=4
3+a*b" | ./Test
15

The result '15' is correct.

Robin on Linux

Monthly Archives: April 2017