Some details about Arduino Uno

In previous article, I reviewed some open source hardwares for children programming, and considered the Micro:bit is the best choice. But recently, after searching many different types of micro-controllers, I noticed an advanced version of Arduino Uno R3 (less than $3) is desperately cheaper than Micro:bit (about $16) in (a famous e-marketing website in China). Despite the low price, Arduino also have a simple and intuitive IDE (Integrated Development Environment):

The programming language for Arduino is Processing, which looks just like C-language. A python developer could also learn Processing very quickly, for it is also a typeless language.
I also find an example code of blinking a LED bubble by Arduino:

Easy enough, right? But it still have more convenient graphic programming tools, such as ArduBlock and Mixly

The demo of ‘ArduBlock’

The demo of ‘Mixly’

With Easy-learning language, and graphic programming interface, I think Arduino is also a good choice for children programming.

My choice between Raspberry PI, Arduino, Pyboard and Micro:bit

I want to teach my child about programming. But Teaching child to sit steadily and keep watching computer screen is not easy, I think, for children usually can’t focus on the boring developing IDE for more than ten minutes. Therefore I try to find some micro-controller which could be used to do some interesting works for kids, such as getting temperature/humidity from environment, or control micro motors on toy car.

There are many micro-controller or micro-computer chips on market. Then I have to compare them and finally choose the most suitable one.

Raspberry PI

Raspberry PI is very powerful. It could do almost anything that a personal-computer or laptop can do. But the problem about Raspberry PI is it is too difficult to learn for a child. And another reason I give up it is the price: $35 for only the bare chip without any peripherals.


Arduino is cheap enough. But you could only use C-language to program it. Using C-language need strong knowledges about computer science, such as memory models and data structures. Imaging use C-language to implement a working-well dictionary, it looks like building a space ship in backyard for a pupil.

Until now, I could narrow my choices to chips that could support python, or micro-python. Because Python is easy to understand, looks much straightforward, and also could be used in imperative mode. In one word, it’s much easier to learn than C.

So let’s take a look at chips which support micro-python.


Pyboard is simple and cheap enough, and also supports micro-python. The only imperfection of it is that its hardware interface is hard to use for someone who doesn’t know hardware very well.


Launched in 2015 by BBC, Micro:bit is the most suitable chip for a child to learning program and even IOT(Internet Of Things), in my opinion. It is cheap: only $18. It supports programming by micro-python and Javascript (Graphic IDE), so child could using a few lines of Python code to control it. It also support a bunch of peripherals such as micro motor, thermometer, hygrometer, bluetooth and WIFI. Children could use it as core-controller to run a intelligent toy car. The Micro:bit even have Android/IOS app for operation, which is perfect for little child under 7 years.

So this is my choice: Micro:bit.

Read paper “Large-Scale Machine Learning with Stochastic Gradient Descent”

Paper reference: Large-Scale Machine Learning with Stochastic Gradient Descent


This GD(Gradient Descent), which is used for computing weight of NN (also used for other Machine Learning Algorithm). zi represents the example ‘i’, also as (xi, yi). After calculate all examples, we need to compute the average for all differentials by weight. Calculating all examples is a slow progress, so we can image GD is not adequate efficient.


Here comes the SGD, which use only one example to compute gradient. It is simpler, and more efficient.


Using SGD in K-mean clustering algorithm seems counterintuitive for me at first glance. But after thinking about “Sample zi belongs to cluster of wk, then don’t wait for all samples, just update wk by zi“, it becomes conceivable.


ASGD is suitable for distributed machine learning environment, since it could get averaged gradient from any example of data at any time (no order restrain).

Read paper “In-Datacenter Performance Analysis of a Tensor Processing Unit”

Paper reference: In-Datacenter Performance Analysis of a Tensor Processing Unit”

Using floating point (16bit or 32bit) for NN (Neural Network) training, then a step called quantization transforms floating-point numbers into narrow integers–often just 8 bits–which are usually good enough for inference.
MLP(Multi-layer Perceptions), CNN(Convolutional Neural Netowrks), and RNN(Recurrent Neural Networks), these three types of NN represent 95% of NN inference workload in Google datacenter. Therefore, the TPU mainly focus on them.

As we can see, CNNs are usually dense-computing NN, which are better for TPU.

TPU has 25 times as many MACs (Multiply and Accumulate) and 3.5 times as much on-chip memory as the K80 GPU.

The TPU was designed to be a coprocessor on the PCIe I/O bus, more like FPU(floating-poin unit) than it is to a GPU.

The parameters of NN model (weights) comes from off-chip memory (8G DDR3 DRAM) to Weight FIFO, and then flow into MMU(Matrix Multiply Unit). The request (sample need to be inference) comes from PCIe to Unified Buffer, and also flow into MMU finally.
Even the “Activation” and “Pooling” algorithm in CNN have been fixed into hardware.

The MMU contains 256×256 MACs that can perform 8-bit multiply-and-adds on signed or unsigned integers.

According to this Floor Plan, we can imaging that UB and MMU might cost most energy of TPU.

TPU instructions follow the CISC tradition and only has about a dozen instructions, include “Read_Host_Memory”, “Read_Weights”, “MatrixMultiply”, “Activate” etc. Recalling how many codes we need to write to implement a effective Activation function, then we could conceive the speed of using only one “Activate” instruction in TPU.
This paper said TPU is a type of Systolic Array. But what is Systolic Array? Here is the explain: A systolic array is a network of processors that rhythmically compute and pass data through the system.

There are lot of tables and diagrams which show the top-rate performance of TPU. Although the TPU is fast, it also depend on the computing-density of applications. The CNNs are most computing-dense NN, so it gains most speed(or TeraOps per second) from TPU:

In this paper, it didn’t explain why the GPU is slower than TPU in inference operation. The only sentence about this topic is in “8 Discussion”: “GPUs have traditionally been seen as high-throughput architectures that reply on high-bandwidth DRAM and thousands of threads to achieve their goals”. Actually, I think this is not a serious explain.
The interesting thing is, after Google publish this paper, the CEO of Nvidia – Jensen Huang, wrote a blog to gently appeal a fact: the state-of-the-art GPU (Tesla P40) can inference faster than TPU. The war between different giants of Deep learning is just beginning.

Read paper “A Column Store Engine for Real-Time Streaming Analytics” (MemSQL)

Paper reference: A Column Store Engine for Real-Time Streaming Analytics

According to the official website, MemSQL is “a high performance data warehouse designed for the cloud that delivers ultra fast insights of your live and historical data”. It use row-storage and lock-free engine for data in memory and column-storage for data in disk.
MemSQL could also store all data into disk. In its most durable state, MemSQL will not lose any transactions which have been acknowledged. And it implements “Read Commited” isolation level for transactions.
I heard about MemSQL in early 2012, but don’t know it has became a OLTP-and-OLAP system until several days ago.

What is the problem?
To fulfill OLAP jobs, MemSQL has to store data into disk with columnar-storage-format. But MemSQL still need to process OLTP requests such as random INSERT or UPDATE. Therefore it store data into a data structure named “Segment”. And by connecting different segments into a ordered segments list (which named “Sorted Runs”), MemSQL could balance the requirements between frequent INSERT/UPDATE operations and SCAN/GROUP BY operations.
For example:
1. Users INSERT three keys: 1, 92, 107. MemSQL will create a segment that contains the three keys:
        [1, 92, 107]
2. Users continue to INSERT two keys: 63, 84. The segments list are:
        [1, 92, 107]
        [63, 84]
3. After many INSERT operations, the segments become:
        [1, 92, 107]
        [2, 17, 42]
        [63, 84]
        [110, 118, 172]

Now,MemSQL makes these segments into “Sorted Runs”, which have a basic order for keys:

When the SELECT comes, MemSQL could find the row quickly by just looking up two ordered segment-lists. Uses could also SCAN two segment-lists effectively to do OLAP tasks, since all the data are stored in columnar-format.
What happen if users INSERT more rows? MemSQL will merge the old big Sorted-Runs and create new segment for freshly-insert-data, which could keep the number of Sorted-Runs acceptable.
In practice, MemSQL column store engine uses a constant of 8, so that the biggest sorted run has at least of all the segments, the second biggest sorted run has at least of the remaining segments, and so forth. This strategy seems just like LSM tree in LevelDB of Google. The difference between LevelDB and MemSQL is LevelDB store every key-value pair respectively but MemSQL store a batch of rows into one segment.

If INSERT operations come when MemSQL is merging, the small merging actions will be aborted and relaunch. For big merging actions, it will barely skip any missing or updated segments, for skipping some segments will not ruin the new merged Sorted-Runs.
As we can see, MemSQL endeavors to avoid locking for in-memory data operations, which makes its performance significantly superior.

There are also some practical considerations for MemSQL.

  1. If only one merger are working at big Sort-Runs, it will cost too much time and the small Sorted-Runs will become tremendous for intensive INSERT operations. So MemSQL launch two mergers: Fast Merger and Slow Merger.
  2. MemSQL could accumulates some rows before batching them into a segment, which will decrease the fragment of data.
  3. MemSQL also create special Commands for users to sort all the segments, or just decrease the level of Sorted-Runs.
  4. Columnar-Storage format in memory makes using SIMD instruments of CPU possible. This shed some light on me: may be one day we can run machine learning jobs on MemSQL directly 🙂

My understanding of CNN (Convolutional Neural Network)

The classic Neural Network of Machine Learning usually use fully-connection, which will cost too much computing resource to get final result if the inputs are high-resolution images. So comes the Convolutional Neural Network. CNN (Convolutional Neural Network) splits the whole big image into small pieces (called Receptive Fields), and do some “Convolutional Operations” (actually are some image transformations, also called Kernels) on each Receptive Field, then the pooling operation (usually max-polling, which is simply collect a biggest feature weight in a 2X2 matrix).

Receptive Fields is easy to understand, but why do it use different kind of “Convolutional Operations” on them? In my opinion, “Convolutional Operations” means using different kind of Kernel Functions to transfer the same image (for example: sharpen the image, or detect the edge of object in image), so they could reveal different views of the same image.
These different Kernel Functions review different “Features” of a image, thus we call them “Feature Maps”:
Convolutional Neural Network
(The matrix of light-yellow is just transferred from light-gray matrix on its left)

By using Receptive Fields and max-pooling, the number of neurons will become very small gradually, which will make computing (or regression) much more easy and fast:
Convolutional Neural Network

Therefore, I reckon the main purpose of using CNN is to reduce the difficulty of computing result of a fully-connected Neural Network.