I am trying to write code for training on multi-GPUs. The code is mainly from the example of ‘Distributed Tensorflow‘. I have changed the code slightly for runing on GPU:

...
tf.train.replica_device_setter(
        worker_device="/job:worker/task:%d/GPU:%d" % (FLAGS.task_index, FLAGS.task_index),
        cluster=cluster)
...

But after launch the script below:

python model.py train 0.9 0.0001 0.53 ps 0 &> ps.log &
python model.py train 0.9 0.0001 0.53 worker 0 &> worker0.log &
python model.py train 0.9 0.0001 0.53 worker 1 &> worker1.log &
...

it reports:

Traceback (most recent call last):
  File "model.py", line 175, in 
    server = tf.train.Server(cluster, job_name = job_name, task_index = task_index)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/server_lib.py", line 147, in __init__
    self._server_def.SerializeToString(), status)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 519, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InternalError: failed initializing StreamExecutor for CUDA device ordinal 2: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY; total memory reported: 11721506816

Seems one MonitoredTrainingSession will occupy all the memory of GPUs. After search on google, I finally get a solution: ‘CUDA_VISIBLE_DEVICES’.
Firstly, change ‘replica_device_setter’:

...
tf.train.replica_device_setter(
        worker_device="/job:worker/task:%d/GPU:0" % FLAGS.task_index,
        cluster=cluster)
...

and then use this shell script to launch training processes:

CUDA_VISIBLE_DEVICES=0 python model.py train 0.9 0.0001 0.53 ps 0 &> ps.log &
sleep 1
for i in `seq 0 2`; do
  dev=`expr ${i} + 1`
  CUDA_VISIBLE_DEVICES=${dev} stdbuf -o0 python model.py train 0.9 0.0001 0.53 worker ${i} &> worker_${i}.log &
  sleep 1
done

The ‘ps’ will only use GPU0, ‘worker0’ will only use GPU1, ‘worker1’ will only use GPU2 etc.