A basic example of using Tensorflow to regress

In theory of Deep Learning, even a network with single hidden layer could represent any function of mathematics. To verify it, I write a Tensorflow example as below:

In this code, it was trying to regress to a number from its own sine-value and cosine-value.
At first running, the loss didn’t change at all. After I changed learning rate from 1e-3 to 1e-5, the loss slowly went down as normal. I think this is why someone call Deep Learning a “Black Magic” in Machine Learning area.

Fix Resnet-101 model in example of MXNET

SSD(Single Shot MultiBox Detector) is the fastest method in object-detection task (Another detector YOLO, is a little bit slower than SSD). In the source code of MXNET,there is an example for SSD implementation. I test it by using different models: inceptionv3, resnet-50, resnet-101 etc. and find a weird phenomenon: the size .params file generated by resnet-101 is smaller than resnet-50.

Model Size of .params file
resnet-50 119MB
resnet-101 69MB

Since deeper network have larger number of parameters, resnet-101 has smaller file size for parameters seems suspicious.

Reviewing the code of example/ssd/symbol/symbol_factory.py:

Why resnet-50 and resnet-101 has the same ‘from_layers’ ? Let’s check these two models:

In resnet-50, the SSD use two layers (as show in red line) to extract features. One from output of stage-3, another from output of stage-4. In resnet-101, it should be the same (as show in blue line), but it incorrectly copy the config code of resnet-50. The correct ‘from_layers’ for resnet-100 is:

This seems like a bug, so I create a pull request to try fixing it.

“Eager Mode” in Tensorflow

Although Tensorflow is the most popular Deep Learning Framework in 2016, Pytorch, a smaller new framework developed by FAIR(Facebook AI Research), become a dark horse this year. Pytorch supports Dynamic Graph Computing, which means you can freely add or remove layers in your model at runtime. It makes developer or scientist build new models more rapidly.
To fight back Pytorch, Tensorflow team add a new mechanism named “Eager Mode”, in which we could also use Dynamic Graph Computing. The example of “Eager Mode” looks like:

As above, unlike traditional Tensorflow application that use “Session.run()” to execute whole graph, developers could see values and gradients of variables in any layer at any step.

How did Tensorflow do it? Actually, the tricks behind the API is not difficult. Take the most common Operation ‘matmul’ as example:

Le’t look into “gen_math_ops._mat_mul()”:

As we can see, in Graph Mode, it will go to “_apply_op_helper()” to build graph (but not running it). In Eager Mode, it will execute the Operation directly.

Training DNN with less memory cost

The paper “Training Deep Nets with Sublinear Memory Cost” tells us a practical method to train DNN with far less memory cost. The mechanism behind is not difficult to understand: when training a deep network (a computing graph), we have to store temporary data in every node, which will occupy extra memory. Actually, we could remove these temporary data after computing each node, and compute them again in back-propagation period. It’s a tradeoff between computing time and computing space.

The author give us an example in MXNET. The improvement of memory-reducing seems tremendous.

Above the version 1.3, tensorflow also brought a similar module: memory optimizer. We can use it like this:

Still need to add op in Resnet:

By using this method, we could increase batch-size even in deep network (Resnet-101 etc.) now.

The performance of R-CNN in mxnet

We are trying to use faster R-CNN network (also is an example in mxnet) to automatically extract bird from pictures. But it will cost 10 seconds to recognize a bird from a picture by using CPU, which is too slow to be used in product environment. To improve the performance, I download the MKL with version-2017u4 from Intel site and install it in the server. After recompile mxnet:

it only cost 3~4 seconds to recognize bird from picture. MKL really works!

Using GPU to do inference is a another option. But a EC2 instance with a GPU device is much more expensive than a normal EC2 instance. So we will still using CPU in the near future.

Price of EC2 instance in US-West(Oregon)

vCPU ECU Memory (GiB) Instance Storage (GB) Linux/UNIX Usage
t2.large 2 Variable 8 EBS Only $0.0928 per Hour
g2.2xlarge 8 26 15 60 SSD $0.65 per Hour

Use Mxnet To Classify Images Of Birds (Fourth Episode)

More than half a year past since previous article. In this period, Alan Mei (my old ex-colleague) collected more than 1 million pictures of Chinese Avians. And after Alexnet, VGG19, I finally chose Resnet-18 as my DNN model to classify different kinds of Chinese birds. Resnet-18 model has far less parameters of network than VGG19, but still get enough capability of representation.

Collecting more than 1 million sample pictures of birds and label them (some by program, and some by hand) is really a tedious work. I really appreciate Alan Mei for accepting so hard a job, although he said he is a Avian fans :). And I also need to thank him for giving me a Personal Computer with GTX970 GPU. Without this GPU, I would not train my model so fast.

To make the accuracy of classifying better, I have read the book “Deep Learning” and many other papers (Not only Resnet-paper, of course). The reward of knowledge about machine learning and deep learning is abundant for me. But the most important of all is: I enjoyed the learning of new technology again.

Today, we launch this simple web: http://en.dongniao.net/ . In Chinese language, “dongniao” means “Understanding Avians”. Hope the Avian-Fans and Depp Learning Fans will love it.

Using Spark-SQL to transfer CSV file to Parquet

After downloading data from “Food and Agriculture Organization of United Nations”, I get many CSV files. One of the file is named “Trade_Crops_Livestock_E_All_Data_(Normalized).csv” and it looks like:

To load this CSV file into Spark and dump it to Parquet format, I wrote these codes:

The build.sbt is

Always remember to add dependency for “spark-sql” or else it will report “createDataFrame() if not a member of spark”.
And finally, the submit script is:

Importance of function’s return Type in Scala

The Scala code is below:

But the output of this section of code is:

Nothing, just a pair of parenthesis. What happen? Why doesn’t Scala use the ‘call’ function in subclass ‘Student’?
The answer is: we forget to define the return Type of function ‘call’, so its default return Type is ‘Unit’. The value of ‘Unit’ is only ‘()’. Hence no matter what value we give to ‘call’, it only return ‘()’.
We just to add return Type:

It print “robin” now.

Data Join in AWS Redshift

In “Amazon Redshift Database Developer Guide“, there is an explanation for data join:
“HASH JOIN and HASH are used when joining tables where the join columns are not both distribution keys and sort keys.
MERGE JOIN is used when joining tables where the join columns are both distribution keys and sort keys, and when less than 20 percent of the joining tables are unsorted.”

Let’s take ‘salary’ and ’employee’ for example.

Firstly, we EXPLAIN the join of ‘salary’ and ’employee’, and it shows “Hash Join”:

Then we create two new tables:

Currently, the join column is both distkey and sortkey. Hence EXPLAIN shows “Merge Join”:

Some tips about “Amazon Redshift Database Developer Guide”

Show diststyle of tables

Details about distribution styles: http://docs.aws.amazon.com/redshift/latest/dg/viewing-distribution-styles.html

How to COPY multiple files into Redshift from S3

Could “Group” (or “Order”) by number, not column name

Change current environment in SQL Editor

Primary key and foreign key
Amazon Redshift does not enforce primary key and foreign key constraints, but the query optimizer uses them when it generates query plans. If you set primary keys and foreign keys, your application must maintain the validity of the keys. 

Distribution info in EXPLAIN
No redistribution is required, because corresponding slices are collocated on the compute nodes. You will typically have only one DS_DIST_NONE step, the join between the fact table and one dimension table.
No redistribution is required, because the inner join table used DISTSTYLE ALL. The entire table is located on every node.
The inner table is redistributed.
The outer table is redistributed.
A copy of the entire inner table is broadcast to all the compute nodes.
The entire inner table is redistributed to a single slice because the outer table uses DISTSTYLE ALL.
Both tables are redistributed.

Create Like

Interleaved skew

The value for interleaved_skew is a ratio that indicates the amount of skew. A value of 1 means there is no skew. If the skew is greater than 1.4, a VACUUM REINDEX will usually improve performance unless the skew is inherent in the underlying set.

About interleaved sort key: http://docs.aws.amazon.com/redshift/latest/dg/t_Sorting_data.html#t_Sorting_data-interleaved

Concurrent write
Concurrent write operations are supported in Amazon Redshift in a protective way, using write locks
on tables and the principle of serializable isolation. 


Redshift UDF
In addition to the Python Standard Library, the following modules are part of the Amazon Redshift implementation:
* numpy 1.8.2 

* pandas 0.14.1 

* python-dateutil 2.2 

* pytz 2015.7 

* scipy 0.12.1 

* six 1.3.0 

* wsgiref 0.1.2

Data Join
* Nested Loop
: The least optimal join, a nested loop is used mainly for cross-joins (Cartesian products) and some 
inequality joins. 

* Hash Join and Hash 
Typically faster than a nested loop join, a hash join and hash are used for inner joins and left and
right outer joins. These operators are used when joining tables where the join columns are not both distribution keys and sort keys. The hash operator creates the hash table for the inner table in the join; the hash join operator reads the outer table, hashes the joining column, and finds matches in the inner hash table. 

* Merge Join 
Typically the fastest join, a merge join is used for inner joins and outer joins. The merge join is not used for full joins. This operator is used when joining tables where the join columns are both distribution keys and sort keys, and when less than 20 percent of the joining tables are unsorted. It reads two sorted tables in order and finds the matching rows. To view the percent of unsorted rows, query the SVV_TABLE_INFO (p. 786) system table. 

You can temporarily override the amount of memory assigned to a query by setting the wlm_query_slot_count parameter to specify the number of slots allocated to the query. 
By default, WLM queues have a concurrency level of 5 

A VARCHAR(12) column can contain 12 single-byte characters, 6 two-byte characters, 4 three- byte characters, or 3 four-byte characters. 

Tuple in Redshift SQL


To reduce processing time and improve overall system performance, Amazon Redshift skips analyzing a table if the percentage of rows that have changed since the last ANALYZE command run is lower
than the analyze threshold specified by the analyze_threshold_percent parameter. By default, analyze_threshold_percent is 10

COPY from DynamoDB
Setting READRATIO to 100 or higher will enable Amazon Redshift to consume the entirety of the DynamoDB table’s provisioned throughput, which will seriously degrade the performance of concurrent read operations against the same table during the COPY session. Write traffic will be unaffected.

Different databases in Redshift
After you have created the TICKIT database, you can connect to the new database from your SQL client. Use the same connection parameters as you used for your current connection, but change the database name to tickit.

Interleaved Sort Key
A maximum of eight columns can be specified for an interleaved sort key.

Concatenate in SQL


Prepare and execute PLAN

Powerful ‘WITH’ for sub-query in SQL

UNLOAD with compression



Show and set current settings

Show blocks(1MB) allocated to each column in the ‘salary’ table

Slice and Col
slice: Node slice
col: Every table you create has three hidden columns appended to it: INSERT_XID, DELETE_XID, ROW_ID

In STV_SLICES , we can see relations between slice and node. Single node have two slices: 0 and 1

Common used tables for meta information