When we got our computing clusters we found that we could need some
tool to
It is a first (working) draft, and has some limitations.
The tool consists of three parts:
rsh
or ssh
to connect to the
remote machine.
dsh
creates a file
and the remote process is only started when this file exists. Also
dsh
assumes all files written from the remote process
when the file does not exist anymore.
It uses Linux /proc
filesystem to determine the load and the
free memory. It has been tested with (non-SMP) Linux 2.0.29 and 2.0.32.
And there is the LSM entry.
Furthermore it assumes that all binaries and files are at the same places on the remote machine. This means the directory where it is executed should be the same as on the local machine.
It does not catch signals sent to the process. Also I/O redirection is not explicitely handled, although it should work as before, the shell taking care of it before invoking dsh.
The evaluation function in the daemon (as well as in the script) can easily be set to something else fitting your needs.
This script can be tailored to run the daemon as user "nobody".
Thanks go to Alexander Schreiber for explaining me where the 'stale NFS
file handle' comes from :-).
Thanks go to Tino Schwarze for discussions and comments. He wrote a similar
package, CLSH. Here the daemon itself starts the remote process to avoid
rsh/ssh latency. This is a good point, but with my approach the daemons
can be run as nobody
and the user itself has to setup the
usual privileges for rsh/ssh.
Also there is the
perfs perl script
(running on Suns, and I don't do perl anyway...) and the
beowulf cluster procps utilities.
The problem is that the procps utilities try
to get the current values each time any of the cluster utilities is
invoked. This is horribly slow, esp. if one machine is down.
In my approach the local daemon takes care of that with UDP packets.
If it doesn't receive any, the machine does not exist for the daemon.
The Mosix approach is going even
farther, but there you have a kernel patch etc, and it is not ready for Linux.
Usage: dshd [-f clusterdesc] [-b] [-p] [-t txport] [-r rxport] -f filename = location of cluster file -b = fork to background -p = print own pid on startup (only if background) -t txport = use other port than 8181 for full status -r rxport = use other port than 8282 for best node report
dshd -b
starts the daemon in the background. It reads the
etcfile specified in the
script (or with the -f etcfile
option) that defines the cluster.
The cluster file looks like:
newton.foo.bar 1.0 1.0 galileo.foo.bar 2.0 2.0 kepler.foo.bar 2.0The first number behind the nodename is the load weight. It means that galileo and kepler in this example are half as fast as - need double the time - than the newton machine. The second number is a memory weight similar to the load weight. Only a higher value means faster RAM. The weights are optional.
The daemon sends its state to those machines and accepts state reports from this cluster only.
dsh commandThis connects to the local daemon and remote-execs the command on the least-loaded machine.
This package is distributed under the GNU Public License.
# dshd v0.1.0 (c) Andre Fachat # distributed under GPL (this is too small to include a copy, go to # www.gnu.org to get a copy or refer to your favorite GNU program for the # file COPYING) # # This daemon runs in the background on each computer in a cluster. # The cluster is defined in the file etcname (see below) # The format is one machine per line with # machinename loadscale memscale # where loadscale and memscale are multiplied with the respective # load and mem values before evaluation. # The daemon sends its state information (load, mem) to all machines # in the cluster. Then it tries to receive the information # from the other machines. If it does not receive a state info during # maxloops loops it removed the machine from the list - it might be down. # # Telnetting to stport gives the state info of the complete cluster # Telnetting to dport gives the state of the best machine. The load # of this machine is locally increased (extraload) to handle the latency # between starting and the new state info to be received. # # This is not particular an example of good programming. # I am especially unexperienced with socket programming, so this might # be improvable. # Also there may be memory leaks that I did not find. # # Further possible improvements: # - cluster definition also by broadcast addresses # - include memory value in evaluation # - make evaluation function more flexible #from the daemon and from the script
# dsh v0.1.1 (c) Andre Fachat # distributed under GPL (this is too small to include a copy, go to # www.gnu.org to get a copy or refer to your favorite GNU program for the # file COPYING) # # This handy script uses the dshd daemon to find the currently least # loaded machine in a cluster. It then distributes the command given # to this machine (via rsh or ssh). The directory where dsh is started must # be at the same place on the remote machine. # To avoid NFS problems a temporary file is created by the local # process and the remote process waits for it to exist # (needs the "waitfile" shell script). After completion the remote # process removes the file and exits. # The local process waits for the child to terminate and then waits # for the temp file to disappear, to be sure all NFS stuff has been done. # # possible improvements: # - catch SIGINT and send to remote process # - own cmdline options for verbosity (print remote host name) etc #
It is not (yet :-) perfect. Sometimes NFS seems to cause weird problems that have not yet been solved.
The scripts have been tested with Linux 2.0.29 as cluster machines and a Sun Ultra with Solaris 2.5 as NFS server.
One word about clustred compiling:
I tried to compile the Linux kernel on the cluster. However, one of the machines was preloaded with 0.9 load already, and two seemed to have gone swapping due to other memory-intensive stuff... I tried three runs on a single machine (with make, make -j2 and make -j4) and three runs on the cluster, with 2, 4 and 8 parallel compilers. I simply redefined MAKE and CC in the toplevel Makefile to "make -j2" and "dsh.py gcc" resp. make: 5m47 make -j2: 5m13 make -j4: 5m12 dsh.py, make -j2 8m38 didn't do version.o dsh.py, make -j4 5m44 worked with 0.1.0, didn't do version.o with 0.1.1 dsh.py, make -j8 4m37 and 4m48 on two runs with 0.1.1 load on the NFS server increased (from practically 0.0 to 0.15...). BTW; the network is 10MBit ethernet (10BaseT) via a Hub. xosview showed that the main machine was mostly doing network stuff (the IRQ for the network card was almost always on and the CPU was doing alot in the system) I guess normal compiles are just to short and to NFS-dependend to effectively distribute them. However, I don't know where the version.o thing comes from...
Another word. The daemons communicate approx. each 1.5 sec one with each other. This might increase you net load! Currently there is no feature to use broadcasts.
Contents last modified 03 Aug 1998|
Go to Homepage
or to this page