TPU

Some tips about using google’s TPU (Cont.)

Sometimes I get this error from TPUEstimator:

...
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 900, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1135, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1316, in _do_run
    run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.DeadlineExceededError: Deadline Exceeded

And after stop and restart TPU in console of GCP, the error disappeared. TPU doesn’t allow users to use it directly like GPU. You can’t see the device in VM looks like ‘/dev/tpu’ or something like this. Google provides TPU as RPC service, so you can only run DNN training through this service. I think this RPC service is not stable enough so sometimes it can’t work and lead to the error ‘Deadline Exceeded’.
When I get this type of error from TPU:

2018-09-29 01:57:12.779430: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:349] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session has not yet been created.

The only solution is to create a new TPU instance and delete the old one in GCP console. Seems Google need to improve the robust of their TPU RPC service.
Running 10000 steps and get ‘loss’ for every turn:

INFO:tensorflow:Loss for final step: 3.2015076.
INFO:tensorflow:Loss for final step: 2.5733204.
INFO:tensorflow:Loss for final step: 1.8888541.
INFO:tensorflow:Loss for final step: 2.3713436.
INFO:tensorflow:Loss for final step: 2.9957836.
INFO:tensorflow:Loss for final step: 1.3974692.
INFO:tensorflow:Loss for final step: 1.3933656.
INFO:tensorflow:Loss for final step: 2.3544135.
INFO:tensorflow:Loss for final step: 1.9383199.
INFO:tensorflow:Loss for final step: 2.0213509.
INFO:tensorflow:Loss for final step: 1.8641331.
INFO:tensorflow:Loss for final step: 1.6767861.
INFO:tensorflow:Loss for final step: 2.63849.
INFO:tensorflow:Loss for final step: 2.19468.
INFO:tensorflow:Loss for final step: 1.9854712.
INFO:tensorflow:Loss for final step: 1.9380764.
INFO:tensorflow:Loss for final step: 0.97299415.
INFO:tensorflow:Loss for final step: 2.089243.
INFO:tensorflow:Loss for final step: 2.1150723.
INFO:tensorflow:Loss for final step: 1.8242038.
INFO:tensorflow:Loss for final step: 2.8426473.

It’s quite strange that the ‘loss’ can’t go low enough. I still need to do more experiments.
Previously, I run MobileNet_v2 in a machine with Geforce GTX 960 and it could process 100 samples per second. And by using 8 TPUs of Version 2, it can process about 500 samples per second. Firstly, I am so disappointed about the performance-boosting of TPUv2, for it only has about 1.4 TFLOPS for each. But then I noticed that may be the bottleneck is not the performance of TPU, since IO is usually the limit for training speed. Besides, my model is MobileNet_v2, which is too simple and light so it can’t excavate all the capability of TPU.
Therefore I set ‘depth_multiplier=4’ for MobileNet_v2. Under this model, GTX 960 could process 21 samples per second, and TPUv2-8 could process 275 samples per second. This time, I can estimate each TPUv2 has about 4 TFLOPS. I know this metric seems too low from Google’s official 45 TFLOPS. But considering the possible bottlenecks of storage IO and network bandwidth, it becomes understandable. And also, there is another possibility: Google’s 45 TFLOPS means the half-precision operation performance 🙂

Google has just release Tensorflow 1.11 for TPU clusters. At first, I think I can use hooks in TPUEstimatorSpec now, but after adding
def model_fn():
    ...
    logging_hook = tf.train.LoggingTensorHook({'loss': loss}, every_n_iter = 100)
    return tf.contrib.tpu.TPUEstimatorSpec(mode, loss = loss, training_hooks = [logging_hook], train_op = train_op)
it reports
INFO:tensorflow:Error recorded from training_loop: Operation u'total_loss' has been marked as not fetchable.
Certainly, the TPU is much harder to use and debug than GPU/CPU.

Some tips about using google’s TPU

About one month ago, I submit a request to Google Research Cloud for using TPU for free. Fortunately, I received the approvement yesterday. The approvement let me use 5 regular Cloud TPUs and 100 preemptible Cloud TPUs for free for 30 days with only submitting my GCP project name to it.
Then I have to change my previous Tensorflow program to let it run on TPUs. I can’t just change tf.device(‘/gpu:0’) to ‘tf.device(‘/tpu:0’) in code to run training on Google TPU. Actually, there are many documents about how to modify the code for this, such as TPUEstimator, Using TPUs etc.
Here are some tips about porting code for TPUs:
1. We can only use TPUEstimator for training

        classifier = tf.contrib.tpu.TPUEstimator(
                model_fn = model_wrapper,
                config = run_config,
                use_tpu = FLAGS.use_tpu,
                train_batch_size = 64,
                batch_axis = [0, 0],
                params = {'optimizer': opt})

Pay attention to the ‘batch_axis’. It tells TPU pod to split data by ‘0’ dimension for data and labels, for I use ‘NHWC’ data format.
2. model_fn and data_input_fn in TPUEstimator has arguments more than regular tf.estimator.Estimator. We need to fetch some arguments (‘batch_size’) from params.

def data_input_fn(params):
    batch = params['batch_size']
...
def model_fn(features, labels, mode, config, params):
...

3. TPU doesn’t support the operation like

images = tf.contrib.image.rotate(images, tf.random_uniform([1], minval = -math.pi / 4.0, maxval = math.pi / 4.0))

So try to avoid using them
4. Carefully use tf.dataset or else it will report data shape error. The code below could run correctly so far

  dataset = files.apply(tf.contrib.data.parallel_interleave(tf.data.TFRecordDataset, sloppy = True, cycle_length = buff_size))
  dataset = dataset.map(_parse_function)
  dataset = dataset.repeat()
  dataset = dataset.apply(tf.contrib.data.batch_and_drop_remainder(batch_size))
  dataset = dataset.shuffle(batch_size * buff_size)
  iterator = dataset.make_initializable_iterator()

5. Because using TPUEstimator, we can’t init iterator of tf.dataset in ‘session.run()’, so a little trick should be used:

def data_input_fn():
    ...
    tf.add_to_collection(tf.GraphKeys.TABLE_INITIALIZERS, it.initializer)
    ...

6. The Tensorflow in GCP VM instance only supports loading datasets from and storing model into GCP storage.

    run_config = tf.contrib.tpu.RunConfig(
            master = master,
            evaluation_master = master,
            model_dir = 'gs://my-project/models/',
            session_config = tf.ConfigProto(
                allow_soft_placement = True, log_device_placement = True),
            tpu_config = tf.contrib.tpu.TPUConfig(
                FLAGS.iterations, FLAGS.num_shards)
        )

7. There aren’t any hooks for TPUEstimator currently in Tensorflow-1.9. So I can’t see any report from console after launching a TPU program. Hope Google could improve it as soon as possible.

Read paper “In-Datacenter Performance Analysis of a Tensor Processing Unit”

Paper reference: In-Datacenter Performance Analysis of a Tensor Processing Unit”
Application
Using floating point (16bit or 32bit) for NN (Neural Network) training, then a step called quantization transforms floating-point numbers into narrow integers–often just 8 bits–which are usually good enough for inference.
MLP(Multi-layer Perceptions), CNN(Convolutional Neural Netowrks), and RNN(Recurrent Neural Networks), these three types of NN represent 95% of NN inference workload in Google datacenter. Therefore, the TPU mainly focus on them.

As we can see, CNNs are usually dense-computing NN, which are better for TPU.

TPU has 25 times as many MACs (Multiply and Accumulate) and 3.5 times as much on-chip memory as the K80 GPU.
Architecture
The TPU was designed to be a coprocessor on the PCIe I/O bus, more like FPU(floating-poin unit) than it is to a GPU.

The parameters of NN model (weights) comes from off-chip memory (8G DDR3 DRAM) to Weight FIFO, and then flow into MMU(Matrix Multiply Unit). The request (sample need to be inference) comes from PCIe to Unified Buffer, and also flow into MMU finally.
Even the “Activation” and “Pooling” algorithm in CNN have been fixed into hardware.

The MMU contains 256×256 MACs that can perform 8-bit multiply-and-adds on signed or unsigned integers.

According to this Floor Plan, we can imaging that UB and MMU might cost most energy of TPU.

TPU instructions follow the CISC tradition and only has about a dozen instructions, include “Read_Host_Memory”, “Read_Weights”, “MatrixMultiply”, “Activate” etc. Recalling how many codes we need to write to implement a effective Activation function, then we could conceive the speed of using only one “Activate” instruction in TPU.
This paper said TPU is a type of Systolic Array. But what is Systolic Array? Here is the explain: A systolic array is a network of processors that rhythmically compute and pass data through the system.
Performance
There are lot of tables and diagrams which show the top-rate performance of TPU. Although the TPU is fast, it also depend on the computing-density of applications. The CNNs are most computing-dense NN, so it gains most speed(or TeraOps per second) from TPU:

In this paper, it didn’t explain why the GPU is slower than TPU in inference operation. The only sentence about this topic is in “8 Discussion”: “GPUs have traditionally been seen as high-throughput architectures that reply on high-bandwidth DRAM and thousands of threads to achieve their goals”. Actually, I think this is not a serious explain.
The interesting thing is, after Google publish this paper, the CEO of Nvidia – Jensen Huang, wrote a blog to gently appeal a fact: the state-of-the-art GPU (Tesla P40) can inference faster than TPU. The war between different giants of Deep learning is just beginning.

Google TPU from Hao(Robin) Dong

Robin on Linux

TPU