Tensorflow Prism

TFPrism is a library that transforms your tensorflow graph to automatically do data parallelism for training. All you need to do to modify your single-cpu tensorflow code to run on a cluster is to send your training op and feed_dict through the library.

Example code

train_step = tf.train.GradientDescentOptimizer(0.9).minimize(loss)

with tf.Session('grpc://mycluster.example.com:5600') as sess:
    train_step, node_copier = tfprism.distribute_graph_on_all_tasks(train_step, sess)
    sess.run(init_op)

    for batch in batches:
        sess.run(
            train_step,
            feed_dict=node_copier.mangle_feed_dict(batch))

Installation

pip install .

Training server / cluster management

The example code above assumes that there is a tensorflow cluster running a set of worker tasks and parameter server tasks, apropriately named “/job:worker” “/job:ps” respectively. To set up this can be a bit tiresome, and if all you want is to quickly get a cluster up and running and parallelize your code, you can use the cluster management tool provided with tfprism.

To install the cluster management tools, you need to do

apt install parallel
pip install .[server]

on each node in your cluster. Once you have done so you can run

tfprism cluster start server1,server2,...serverN

to start your cluster. You need to be able to ssh without passwords (using public key auth) to all servers listed. After this you can connect to port grpc://server1:5600 using tensorflow.