LAM/MPI logo

LAM FAQ: Typical Setup of LAM

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just the FAQ
Table of contents:
  1. Can LAM be used with multi-threaded user code?
  2. What setup does each LAM user need?
  3. How does one typically use LAM?
  4. Does LAM need to be booted to compile LAM programs?
  5. Do I need to install LAM on all nodes in my cluster?
  6. Do I have to use the same version of LAM/MPI everywhere?
  7. Should I run LAM as a root-level service for all my users to access?
  8. How should I setup LAM for multiple users?
  9. Do I need a common filesystem on all my nodes?
  10. Why isn't LAM_SESSION_PREFIX distributed to all nodes?
  11. Can LAM be used with AFS?
  12. Can LAM be used with ssh? More to the point -- does LAM have to use rsh?
  13. What directory do I install LAM to?

[ Return to FAQ ]


1. Can LAM be used with multi-threaded user code?

Yes, but LAM is not thread safe. And unfortunately, it may take a while to make LAM thread safe. To do so will require a major redesign and overhaul of the LAM implementation. We're working on it, but it will take a long time before it is ready.

But LAM can be used in multi-threaded applications. The general rule of thumb in using a non-thread safe library in a multi-threaded application is to restrict all calls to the library in a single thread. That is, create an "MPI thread" that is the only thread that interacts with LAM. This approach has shown to work adequately well; it can be implemented with incoming and outgoing queues for the MPI thread -- other threads can place messages on (and remove messages from) these queues while the MPI thread performs the actual message passing "in the background."

An alternative approach is to have a "global MPI mutex". Have a single mutex that locks access to LAM; any thread that wishes to access LAM/MPI must first obtain this lock. This ensures that only one thread accesses LAM at a time. This is effectively what setting the MPI thread level to MPI_THREAD_SERIALIZED does; LAM/MPI itself will ensure that if multiple threads invoke MPI function calls simultaneously, only one will actually be allowed in the MPI library at a time. Be aware, however, that using this technique can lead to deadlock if you are not careful -- putting a single lock around all MPI calls specifically precludes any concurrency within the LAM library. The programmer must be aware of this, and must realize that multiple blocking calls to MPI can still cause deadlock.

This approach has gotten mixed reviews on the LAM mailing list.

LAM/MPI must be compiled with the same threading flags as your user appliction. These vary from compiler to compiler. For example, on Solaris with Sun Workshop/Forte compilers, the flag -mt is needed to build multi-threaded applications.

[ Top of page | Return to FAQ ]


2. What setup does each LAM user need?

Most boot modules require two main things that each LAM user needs setup in their environment (check the LAM/MPI User's Guide for specific requirements for each boot module):

  1. The directory for the LAM executables needs to be in their path on all machines that LAM is to be used on. This is typically set in the user's $HOME/.cshrc, $HOME/.profile, $HOME/.bashrc, or other shell startup file.

    This needs to be set before the shell startup script exits for non-interactive scripts. For example, if your startup script has a line similar to the following:

    if ($?USER == 0 || $?prompt == 0) exit
    

    then you need to ensure to set the LAM binary directory in your path before this line.

  2. The user needs to be able to execute command on remote nodes without being prompted for a password, and with no extraneous output on stderr.

    LAM uses rsh by default (although this can be overridden, we will use it here for example purposes) -- the user must be able to rsh programs to remote nodes:

    shell$ rsh othernode.example.com uptime
    3:45pm up 133 day(s), load average: 0.05, 0.12, 0.25
    

    Notice that the user was not prompted for a password on the machine othernode.example.com, nor did any extraneous output appear (particularly extraneous stderr output).

    Specifically, the user must be able to rsh LAM executables on remote nodes (i.e., not just uptime), but if #1 is taken care of properly, this will happen as well.

The recon tool is good for checking that the user's environment is setup properly; it checks both of these items as well as a few other things.

[ Top of page | Return to FAQ ]


3. How does one typically use LAM?

LAM is a daemon-based implementation of MPI. This means that a daemon process is launched on each machine that will be in the parallel environment. Once the daemons have been launched, LAM is ready to be used. A typical usage scenario is as follows:

  • Boot LAM on all the nodes (with the lamboot command)
  • Run MPI programs (with the mpirun command)
  • Optionally "clean" the LAM/MPI environment (with the lamclean command)
  • Shut down LAM (with the lamhalt command)

LAM is a user-based MPI environment; each user who wishes to use LAM must boot their own LAM environment. LAM is not a client-server environment where a single LAM daemon can service all LAM users on a given machine. There are no future plans to make LAM client-server oriented.

As a side-effect of this design, each user must have an account on each machine that they wish to use LAM on.

It is a common misconception that you need to lamboot/mpirun/lamhalt for every program. This is not true.

You only need to lamboot once. You can then mpirun/lamclean as often as you wish. When you are finished with MPI, you can lamhalt once to remove LAM from all machines.

Use of the lamhalt command is prefered over the older wipe command. lamhalt is considerably faster and does not require the user to specify a hostfile.

[ Top of page | Return to FAQ ]


4. Does LAM need to be booted to compile LAM programs?

No. The compilation of LAM programs is completely independant of the run-time environment necessary for running LAM/MPI programs.

[ Top of page | Return to FAQ ]


5. Do I need to install LAM on all nodes in my cluster?

Short answer: yes.

More complicated answer: LAM needs to be available on all nodes in your cluster. You can do this by physically installing LAM on all nodes in your cluster or by using a networked filesystem to make a single LAM installation available on all nodes in the cluster.

See the FAQ question "Do I need a common filesystem on all my nodes?" for more details on this method.

[ Top of page | Return to FAQ ]


6. Do I have to use the same version of LAM/MPI everywhere?

YES!!

Things change between versions of LAM/MPI -- different versions of LAM/MPI do not play well with each other.

LAM/MPI is intended to be source code compatible with a user application between multiple versions of LAM/MPI -- no other guarantees are provided. That is, the MPI API is fixed and will not change; the actual implementation of the MPI API is not. Specifically, the back-end implementation is free to completely change between versions (and frequently does). As such, you must absolutely guarantee that the same version of LAM is being used on all nodes when trying to lamboot, mpicc, mpirun, etc.

A not-so-obvious side effect of this is that user applications must be recompiled and relinked when a new version of LAM/MPI is installed. Hence, not only do LAM commands have to be consistent (in terms of version), user applications must also be compiled for a specific version of LAM. For example, if a user application is compiled with mpicc from LAM version a.b.c, attempting to run that user application with the mpirun from LAM version d.e.f will fail.

Note, however, that this does not preclude the possibility of having multiple versions of LAM/MPI installed on a single cluster. Each user just must ensure that their $PATH is consistently set across the cluster to access a single version at any given time, such that all the LAM/MPI commands used will be from that same single version (lamboot, mpicc, mpirun, etc.).

It is not uncommon for advanced users to have multiple versions of LAM/MPI installed, and change their $PATH accordingly to access the different versions.

[ Top of page | Return to FAQ ]


7. Should I run LAM as a root-level service for all my users to access?

No. It is a Very Bad Idea to run the LAM executables as root.

LAM was designed to be run by individual users; it was not designed to be run as a root-level service where multiple users use the same LAM daemons in a client-server fashion. LAM should be booted by each individual user who wishes to run MPI programs. There are a wide array of security issues when root runs a service-level daemon; LAM does not even attempt to address any of these issues.

Especially with today's propensity for hackers to scan for root-owned network daemons, it could be tragic to run this program as root. While LAM is known to be quite stable, and LAM does not leave network sockets open for random connections after the initial setup, several factors should strike fear into system administrator's hearts if LAM were to be constantly running for all users to utilize:

  1. LAM leaves a Unix domain socket open on each machine (usually under the /tmp directory). Hence, if root is compromised on one machine, root is effectively compromised on all machines that are connected via LAM.

  2. There must be a .rhosts (or some other trust mechanism) for root to allow running LAM on remote nodes. Depending on your local setup, this may not be safe.

  3. LAM has never been checked for buffer overflows and other malicious input types of errors. LAM is tested heavily before release, but never from a root-level security perspective.

  4. LAM programs are not audited or tracked in any way. This could present a sneaky way to execute binaries without log trails (especially as root).

Hence, it's a Very Bad Idea to run LAM as root. LAM binaries will quit immediately if root runs them. Login as a different user to run LAM.

The one exception to this is the recon tool -- root is allowed to run recon because it is typical for system administrators to want to verify a LAM installation; there is little harm in allowing root-level access to this tool.

[ Top of page | Return to FAQ ]


8. How should I setup LAM for multiple users?

As stated above, there are two main factors in getting LAM to work for most users:

  1. Having the LAM binaries in their path
  2. Having the ability to launch programs (including the LAM binaries) on remote nodes

As a system administrator, the following suggestions will make it easier for your users to run LAM/MPI programs:

  • Install LAM on all the machines that it will be used on. You can do this by manually installing on each machine, or making LAM available on a common filesystem (such as NFS) to all the machines. See the "Do I need a common filesystem on all my nodes to run LAM?" question.
  • If you control a site-wide shell startup file (such as /etc/Cshrc or other such global startup file), you can place the LAM binaries in user's path without them having to do anything. You can also set any environment variables to override LAM defaults; see the LAM/MPI User's Guide for more details.

[ Top of page | Return to FAQ ]


9. Do I need a common filesystem on all my nodes?

No, but it certainly makes life easier if you do.

A common environment to run LAM is in a Beowulf-class or other workstation cluster. Simply stated, LAM can run on a group of workstations connected by a network. As mentioned above, there are several prerequisites, however (for example, the rsh boot module requires that the user must have an account on all the machines, the user can rsh [or ssh, or whatever other remote shell transport capability is desired -- see above for how to change the underlying remote shell transport] to all the machines, etc.).

This raises the question for LAM system administrators: where to install the LAM binaries, header files, etc.? This discussion mainly addresses this question for homogeneous clusters (i.e., where all nodes and operating systems are the same), although elements of this discussion apply to heterogeneous clusters as well. Heterogeneous admins are encouraged to read this discussion and then see the heterogeneous section of this FAQ.

There are two main choices:

  1. Have a common filesystem, such as NFS, between all the machines to be used. Install the LAM files such that the installation directory is the same value on each node. This will greatly simplify user's .bashrc/.cshrc/.profile scripts -- the PATH can be set without checking which machine the user is on. It also simplifies the system administrator's job; when the time comes to patch or otherwise upgrade LAM, only one copy needs to be modified.

    For example, consider a cluster of four machines: inky, blinky, pinky, and clyde.

    • Install LAM on inky's local hard drive in the directory /home/lam. The system administrator then mounts inky:/home/lam on the remaining three machines, such that /home/lam on all machines is effectively "the same". That is, the following directories all contain the LAM installation:

      inky:/home/lam
      blinky:/home/lam
      pinky:/home/lam
      clyde:/home/lam
      

    • Install LAM on inky's local hard drive in the directory /usr/local/lam-7.1.1. The system administrator then mounts inky:/usr/local/lam-7.1.1 on all four machines in some other common location, such as /home/lam (a symbolic link can be installed on inky instead of a mount point for efficiency). This strategy is typically used for environments where one tree is NFS exported, but another tree is typically used for the location of actual installation. For example, the following directories all contain the LAM installation:

      inky:/home/lam
      blinky:/home/lam
      pinky:/home/lam
      clyde:/home/lam
      

      Notice that there are the same four directories as the previous example, but on inky, the directory is actually located in /usr/local/lam-7.1.1.

    There is a bit of a disadvantage in this approach; each of the remote nodes have to incur NFS (or whatever filesystem is used) delays to access the LAM directory tree. However, both the administration ease and low cost (relatively speaking) of using a networked file system usually greatly outweighs the cost. Indeed, once an MPI application is running, it doesn't use the LAM binaries very much.

  2. If you are concerned with networked filesystem costs of accessing the LAM binaries, you can install LAM on the local hard drive of each node in your system. Again, it is highly advisable to install LAM in the same directory on each node so that each user's PATH can be set to the same value, regardless of the node that a user has logged on to.

    This approach will save some network latency of accessing the LAM binaries, but is typically only used where users are very concerned about squeezing every spare cycle out of their machines.

[ Top of page | Return to FAQ ]


10. Why isn't LAM_SESSION_PREFIX distributed to all nodes?

Unlike LAM_SESSION_SUFFIX, LAM_SESSION_PREFIX is specifically not distributed to all nodes by the LAM runtime environment. It is treated as a local value -- a LAM-specific $TMPDIR. POSIX conventions dictate that $TMPDIR is relevant and specific to a given node and/or environment.

As such, since LAM_SESSION_PREFIX is the LAM-specific equivalent to $TMPDIR, it is left up to the user and/or back-end run-time environment to provide a relevant value for LAM_SESSION_PREFIX (but only if it is necessary!) on each node. For example, in rsh/ssh-based environments, this is typically accomplished by setting LAM_SESSION_PREFIX in user's shell startup files (e.g., .profile, .tcshrc, etc.).

Also remember that LAM_SESSION_PREFIX is typically not necessary -- see the Section entitles "LAM's Session Directory" in the User's Documentation for more details.

[ Top of page | Return to FAQ ]


11. Can LAM be used with AFS?

Yes.

Many sites tend to install the AFS rsh replacement that passes tokens to the remote machine as the default rsh. Similarly, most modern versions of ssh have the ability to pass AFS tokens. Hence, if you are using the rsh boot module with recon or lamboot, your AFS token will be passed to the remote LAM daemon automatically. If your site does not install the AFS replacement rsh as the default, consult the documentation on --with-rsh to see how to set the path to the rsh that LAM will use.

Once you use the replacement rsh or an AFS-capable ssh, you should get a token on the target node when using the rsh boot module. This means that your LAM daemons are running with your AFS token, and you should be able to run any program that you wish, including those that are not system:anyuser accessible. You will even be able to write into AFS directories where you have write permission (as you would expect).

NOTE: If you are using a different boot module, you may experience problems with obtaining AFS tokens on remote nodes.

Keep in mind, however, that AFS tokens have limited lives, and will eventually expire. This means that your LAM daemons (and user MPI programs) will lose their AFS permissions after some specified time unless you renew your token (with the klog command, for example) on the originating machine before the token runs out. This can play havoc with long-running MPI programs that periodically write out file results; if you lose your AFS token in the middle of a run, and your program tries to write out to a file, it will not have permission to, which may cause Bad Things to happen.

If you need to run long MPI jobs with LAM on AFS, it is usually advisable to ask your AFS administrator to increase your default token life time to a large value, such as 2 weeks.

[ Top of page | Return to FAQ ]


12. Can LAM be used with ssh? More to the point -- does LAM have to use rsh?

Actually, starting with LAM v7.0, lamboot can use a variety of different methods to start the LAM run-time environment. Specifically, LAM offers multiple different boot SSI modules. The rsh boot module has the capability of either rsh, ssh, or any other rsh-like remote agent program as long as the following conditions are met:

  • Users can launch programs on remote nodes without being prompted for a password
  • No information is output to stderr before the command is executed

Note that the 1.x series of ssh clients may have a standard xauth message printed out to stderr. This message must be suppressed with the "-x" option to ssh, or LAM will interpret the stderr output to indicate a failure to launch a remote program. The "-q" option may also be necessary to squelch other stderr warnings.

Note that you can specify "ssh -x" either with the --with-rsh option to configure, or via the LAMRSH environment variable at run time. For example:

shell$ ./configure --with-rsh="ssh -x"

Or, to specify the boot-rsh-agent SSI parameter:

shell$ lamboot -ssi boot-rsh-agent ssh my_hostfile

[ Top of page | Return to FAQ ]


13. What directory do I install LAM to?

See the question "Do I need a common filesystem on all of my nodes?" -- it also addresses the issue of what directory to install the LAM binaries into (the two issues are directly related).

[ Top of page | Return to FAQ ]