Parallel Execution Utility - dsh 0.1.3

(C) A. Fachat <a.fachat@physik.tu-chemnitz.de>


When we got our computing clusters we found that we could need some tool to automatically distribute processes on the least loaded cluster. This is exactly what this tools does. It takes a command and tries to distribute it to the least-loaded machine in a Linux cluster.

It is a first (working) draft, and has some limitations.

The tool consists of three parts:

  1. A daemon (python 1.5 script) to keep the current load state of each cluster machine. The daemons accross the network communicate with each other. They know of each other by the means of a cluster description file. This holds one nodename (+ relative load factor, + relative memory factor) per line. A daemon need not need to communicate to all clusters, you can for example restrict the communication to the neighbouring nodes with this file.
    This daemon gives, when connected to at a certain TCP port, a list of machines with their respective load. On another TCP port it returns the best machine according to its internal evaluation function. At the same time this machine gets an additional, virtual load for a small time. This handles network latency, i.e. the time from starting until the new load has been sent back to the daemon.
  2. A script (python 1.5 script) that connects to either port of the daemon and uses the returned list for internal evaluation. The best machine is then used to send the given command to. It uses rsh or ssh to connect to the remote machine.
  3. A helper script (/bin/sh script) that has to be on the remote machine. It handles the file locking that is used to take care of NFS latency - basically the dsh creates a file and the remote process is only started when this file exists. Also dsh assumes all files written from the remote process when the file does not exist anymore.
You need Python (1.5 or better) for dsh and dshd.

It uses Linux /proc filesystem to determine the load and the free memory. It has been tested with (non-SMP) Linux 2.0.29 and 2.0.32. And there is the LSM entry.

Furthermore it assumes that all binaries and files are at the same places on the remote machine. This means the directory where it is executed should be the same as on the local machine.

It does not catch signals sent to the process. Also I/O redirection is not explicitely handled, although it should work as before, the shell taking care of it before invoking dsh.

The evaluation function in the daemon (as well as in the script) can easily be set to something else fitting your needs.


What's new?


Acknowledgements

Thanks go to Alexander Schreiber for explaining me where the 'stale NFS file handle' comes from :-). Thanks go to Tino Schwarze for discussions and comments. He wrote a similar package, CLSH. Here the daemon itself starts the remote process to avoid rsh/ssh latency. This is a good point, but with my approach the daemons can be run as nobody and the user itself has to setup the usual privileges for rsh/ssh. Also there is the perfs perl script (running on Suns, and I don't do perl anyway...) and the beowulf cluster procps utilities. The problem is that the procps utilities try to get the current values each time any of the cluster utilities is invoked. This is horribly slow, esp. if one machine is down. In my approach the local daemon takes care of that with UDP packets. If it doesn't receive any, the machine does not exist for the daemon. The Mosix approach is going even farther, but there you have a kernel patch etc, and it is not ready for Linux.


Usage

Usage: dshd [-f clusterdesc] [-b] [-p] [-t txport] [-r rxport]
  -f filename    = location of cluster file
  -b             = fork to background
  -p             = print own pid on startup (only if background)
  -t txport      = use other port than  8181  for full status
  -r rxport      = use other port than  8282  for best node report
dshd -b starts the daemon in the background. It reads the etcfile specified in the script (or with the -f etcfile option) that defines the cluster. The cluster file looks like:
newton.foo.bar 1.0 1.0
galileo.foo.bar 2.0 2.0
kepler.foo.bar 2.0
The first number behind the nodename is the load weight. It means that galileo and kepler in this example are half as fast as - need double the time - than the newton machine. The second number is a memory weight similar to the load weight. Only a higher value means faster RAM. The weights are optional.

The daemon sends its state to those machines and accepts state reports from this cluster only.

    dsh command
This connects to the local daemon and remote-execs the command on the least-loaded machine.


Copying Policy

This package is distributed under the GNU Public License.


Download


Documentation and random comments

Here are some comments from the files:
# dshd v0.1.0 (c) Andre Fachat
# distributed under GPL (this is too small to include a copy, go to
# www.gnu.org to get a copy or refer to your favorite GNU program for the
# file COPYING)
#
# This daemon runs in the background on each computer in a cluster.
# The cluster is defined in the file etcname (see below)
# The format is one machine per line with
#    machinename loadscale memscale
# where loadscale and memscale are multiplied with the respective
# load and mem values before evaluation.
# The daemon sends its state information (load, mem) to all machines
# in the cluster. Then it tries to receive the information
# from the other machines. If it does not receive a state info during
# maxloops loops it removed the machine from the list - it might be down.
#
# Telnetting to stport gives the state info of the complete cluster
# Telnetting to dport gives the state of the best machine. The load
# of this machine is locally increased (extraload) to handle the latency
# between starting and the new state info to be received.
#
# This is not particular an example of good programming.
# I am especially unexperienced with socket programming, so this might
# be improvable.
# Also there may be memory leaks that I did not find.
#
# Further possible improvements:
# - cluster definition also by broadcast addresses
# - include memory value in evaluation
# - make evaluation function more flexible
#
from the daemon and from the script
# dsh v0.1.1 (c) Andre Fachat
# distributed under GPL (this is too small to include a copy, go to
# www.gnu.org to get a copy or refer to your favorite GNU program for the
# file COPYING)
#
# This handy script uses the dshd daemon to find the currently least
# loaded machine in a cluster. It then distributes the command given
# to this machine (via rsh or ssh). The directory where dsh is started must
# be at the same place on the remote machine.
# To avoid NFS problems a temporary file is created by the local
# process and the remote process waits for it to exist
# (needs the "waitfile" shell script). After completion the remote
# process removes the file and exits.
# The local process waits for the child to terminate and then waits
# for the temp file to disappear, to be sure all NFS stuff has been done.
#
# possible improvements:
# - catch SIGINT and send to remote process
# - own cmdline options for verbosity (print remote host name) etc
#

It is not (yet :-) perfect. Sometimes NFS seems to cause weird problems that have not yet been solved.

The scripts have been tested with Linux 2.0.29 as cluster machines and a Sun Ultra with Solaris 2.5 as NFS server.

One word about clustred compiling:

I tried to compile the Linux kernel on the cluster. However, one of
the machines was preloaded with 0.9 load already, and two seemed to have
gone swapping due to other memory-intensive stuff...

I tried three runs on a single machine (with make, make -j2 and make -j4)
and three runs on the cluster, with 2, 4 and 8 parallel compilers.
I simply redefined MAKE and CC in the toplevel Makefile to
"make -j2" and "dsh.py gcc" resp.

make:                   5m47
make -j2:               5m13
make -j4:               5m12
dsh.py, make -j2        8m38  didn't do version.o
dsh.py, make -j4        5m44  worked with 0.1.0, didn't do version.o with 0.1.1
dsh.py, make -j8        4m37 and 4m48 on two runs with 0.1.1

load on the NFS server increased (from practically 0.0 to 0.15...). 
BTW; the network is 10MBit ethernet (10BaseT) via a Hub.
xosview showed that the main machine was mostly doing network stuff
(the IRQ for the network card was almost always on and the CPU was doing alot
in the system)

I guess normal compiles are just to short and to NFS-dependend to effectively
distribute them.

However, I don't know where the version.o thing comes from...

Another word. The daemons communicate approx. each 1.5 sec one with each other. This might increase you net load! Currently there is no feature to use broadcasts.


Contents last modified 03 Aug 1998| Go to Homepage or to this page