The paper “Training Deep Nets with Sublinear Memory Cost” tells us a practical method to train DNN with far less memory cost. The mechanism behind is not difficult to understand: when training a deep network (a computing graph), we have to store temporary data in every node, which will occupy extra memory. Actually, we could remove these temporary data after computing each node, and compute them again in back-propagation period. It’s a tradeoff between computing time and computing space.
The author give us an example in MXNET. The improvement of memory-reducing seems tremendous.
Above the version 1.3, tensorflow also brought a similar module: memory optimizer. We can use it like this:

from tensorflow.core.protobuf import rewriter_config_pb2
....
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
config.allow_soft_placement = True
config.graph_options.rewrite_options.memory_optimization = memory_optimization=rewriter_config_pb2.RewriterConfig.MANUAL
with tf.Session(config=config) as sess:
....

Still need to add op in Resnet:

# Add 'reshape' with a special name at the end of every residual-unit
shape = x.get_shape()
dims = []
for i in range(shape.ndims):
    dims.append(shape.dims[i].value)
x = tf.reshape(x, dims, name='robin_137')
....
def ignore(op):
    return op.type in ['Const', 'VariableV2', 'Identity', 'Assign', 'Placeholder', 'RandomShuffleQueueV2', 'QueueEnqueueV2', 'QueueDequeueManyV2']
def checkpoint(op):
    m = re.compile(".*robin_137").match(op.name)
    if m:
        return True
    return False
# Set attribute here
ops = tf.get_default_graph().get_operations()
mirrors = filter(lambda op: not ignore(op) and not checkpoint(op), ops)
for op in mirrors:
    op.node_def.attr['_recompute_hint'].i = 0

By using this method, we could increase batch-size even in deep network (Resnet-101 etc.) now.