Configuring a Distributed Worker Pool

Configuring a Distributed Worker Pool

Before your program can perform a distributed optimization task, you'll need to identify a set of machines to use as your distributed workers. Ideally these machines should give very similar performance. Identical performance is best, especially for distributed tuning, but small variations in performance won't hurt your overall results too much.

Specifying the Distributed Worker Pool

Once you've identified your distributed worker machines, you'll need to start Gurobi Remote Services on these machines. Instructions for setting up Gurobi Remote Services can be found in the Gurobi Quick Start Guide. As noted in the Quick Start Guide, run following command to make sure a machine is available to be used as a distributed worker:

> gurobi_cl --server=machine --status
(replace machine with the name or IP address of your machine). If you see Distributed Worker listed among the set of available services...
Gurobi Remote Services (version 6.5.0) functioning normally
Available services: Distributed Worker
then that machine is good to go.

We should reiterate a point that is raised in the Quick Start Guide: you do not need a Gurobi license to run Gurobi Remote Services on a machine. Some services are only available with a license (e.g., Compute Server). However, any machine that is running Gurobi Remote Services will provide the Distributed Worker service.

The Distributed Manager Machine

Once you have identified a set of distributed worker machines, you'll need to choose a manager machine. This is the machine where your application actually runs. In addition to building the optimization model, your manager machine will coordinate the efforts of the distributed workers during the execution of the distributed algorithm.

Image distributed

Note that once the distributed algorithm completes, only the manager retains any information about the solution. The distributed workers go off to work on other things.

You'll need to choose a manager machine that is licensed to run the distributed algorithms. You'll see a DISTRIBUTED= line in your license file if distributed algorithms are enabled.

Note that, by default, the manager does not participate in the distributed optimization. It simply coordinates the efforts of the distributed workers. If you would like the manager to also act as one of the workers, you'll need to start Gurobi Remote Services on the manager machine as well.

Image distributed2

The workload associated with managing the distributed algorithm is quite light, so a machine can handle both the manager and worker role without degrading performance.

Note that we only allow a machine to act as manager for a single distributed job. If you want to run multiple distributed jobs simultaneously, you'll need multiple manager machines.

Specifying the Distributed Worker Pool

If you'd like to invoke a distributed algorithm from your application, you'll need to provide the names of the distributed worker machines. You do this by setting the WorkerPool parameter (refer to the Gurobi Parameter section for information on how to set a parameter). The parameter should be set to a string that contains a comma-separated list of either machine names or IP addresses. For example, you might use the following in your gurobi_cl command line:

> gurobi_cl WorkerPool=server1,server2,server3 ...

If you have set up an access password on the distributed worker machines, you'll need to provide it through the WorkerPassword parameter. All machines in the worker pool must have the same access password.

Note that providing a list of available workers is strictly a configuration step. Your program won't actually use any of the distributed algorithms unless it specifically requests them. Instructions for doing so are next.

Requesting A Distributed Algorithm

Once you've set the WorkerPool parameter to the appropriate value, your final step is to set the ConcurrentJobs, DistributedMIPJobs, or TuneJobs parameter. These parameters indicate how many distinct distributed worker jobs you would like to start. For example, if you set TuneJobs to 2 in grbtune...

> grbtune WorkerPool=server1,server2 TuneJobs=2 misc07.mps
...you should see the following output in the log...
Started distributed worker on server1
Started distributed worker on server2

Distributed tuning: launched 2 distributed worker jobs
This output indicates that two jobs have been launched, one on machine server1 and the other on machine server2. These two jobs will continue to run until your tuning run completes.

Similarly, if you launch distributed MIP...

> gurobi_cl WorkerPool=server1,server2 DistributedMIPJobs=2 misc07.mps
...you should see the following output in the log...
Started distributed worker on server1
Started distributed worker on server2

Distributed MIP job count: 2

Note that, in most cases, each machine runs one distributed worker job at a time. Distributed workers are allocated on a first-come, first-served basis, so if multiple users are sharing a set of distributed worker machines, you should be prepared for the possibility that some or all of them may be busy when the manager requests them. The manager will grab as many as it can, up to the requested count. If none are available, it will return an error.

Compute Server Considerations

If you have one or more Gurobi Compute Servers, you can use them for distributed optimization as well. Compute Servers offer a lot more flexibility than distributed workers, though, so they require a bit of additional explanation.

The first point you should be aware of is that one Compute Server can actually host multiple distributed worker jobs. Compute Servers allow you to set a limit on the number of jobs that can run simultaneously. Each of those jobs can be a distributed worker. For example, if you have a pair of Compute Servers, each with a job limit of 2, then issuing the command...

> gurobi_cl DistributedMIPJobs=3 WorkerPool=server1,server2 misc07.mps
...would produce the following output...
Started distributed worker on server1
Started distributed worker on server2
Started distributed worker on server1
Compute Server assigns a new job to the machine with the most available capacity, so assuming that the two servers are otherwise idle, the first distributed worker job would be assigned to server1, the second to server2, and the third to server1.

Another point to note is that, if you are working in a Compute Server environment, it is often better to use the Compute Server itself as the distributed manager, rather than the client machine. This is particularly true if the Compute Server and the workers are physically close to each other, but physically distant from the client machine. In a typical environment, the client machine will offload the Gurobi computations onto the Compute Server, and the Compute Server will then act as the manager for the distributed computation.

To give an example, running following command on machine client1:

> gurobi_cl --server=server1 WorkerPool=server1,server2 DistributeMIPJobs=2 misc07.mps
...will lead to the following sequence of events...
  • The model will be read from the disk on client1 and passed to Compute Server server1.
  • Machine server1 will act as the manager of the distributed optimization.
  • Machine server1 will start two distributed worker jobs, one that also runs on server1 and another that runs on server2.

Compute Server provides load balancing among multiple machines, so it is common for the user to provides a list of available servers when a Gurobi application starts. We'll automatically copy this list into the WorkerPool parameter. Of course, you can change the value of this parameter in your program, but the default behavior is to draw from the same set of machines for the distributed workers. Thus, the following command would be equivalent to the previous command:

> gurobi_cl --server=server1,server2 DistributedMIPJobs=2 misc07.mps

Please refer to the next section section for more information on using a Gurobi Compute Server.