LAM/MPI logo

LAM FAQ: Running LAM/MPI applications

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just the FAQ
Table of contents:
  1. How do I compile my LAM/MPI program?
  2. How do I change the compilers that mpicc, mpic++/mpiCC, and mpif77 use?
  3. My Fortran MPI program fails to link! Why?
  4. My C++ MPI program fails to link! Why?
  5. Can MPI jobs be checkpointed and restarted?
  6. Does LAM/MPI support Myrinet?
  7. Does LAM/MPI support Infiniband?
  8. Can I run multi-process MPI applications on a single machine?
  9. How do I measure the performance of my parallel program?
  10. What directory does my LAM/MPI program run in on the remote nodes?
  11. How does LAM find binaries that are invoked from mpirun?
  12. Why doesn't "mpirun -np 4 test" work?
  13. Can I run multiple LAM/MPI programs simultaneously?
  14. Can I pass environment variables to my LAM/MPI processes on the remote nodes upon invocation?
  15. mpirun -c and mpirun -np -- what's the difference?
  16. What is "psuedo-tty support"? Do I want that?
  17. Why can't my process read stdin?
  18. Why can only rank 0 read from stdin?
  19. What is the lamd RPI module?
  20. Why would I use the lamd RPI module (vs. other RPI modules)?
  21. How do I run LAM/MPI user programs on multi-processor machines?
  22. Can I mix multi-processor machines with uni-processor machines in a single LAM/MPI user program run?
  23. How do I run an MPMD program? More specifically -- how do I start different binaries on each node?
  24. How do I mpirun across a heterogeneous cluster?
  25. My LAM/MPI process doesn't seem to reach MPI_INIT. Why?
  26. My LAM/MPI process seems to get "stuck" -- it runs for a while and then just hangs. Why?
  27. TCP performance under Linux 2.2.0-2.2.9 just plain sucks! Why?

[ Return to FAQ ]


1. How do I compile my LAM/MPI program?

The mpicc, mpic++/mpiCC, and mpif77 "wrapper" compilers are provided to compile C, C++, and Fortran LAM/MPI programs (respectively).

These so-called "wrapper" compilers are provided to insert all the relevant compiler and linker flags. That is, the directories where LAM include files and libraries are required to compile/link LAM/MPI programs. Rather that forcing the user to supply these flags manually, the "wrapper" compilers simply take all user arguments, pass them through to the underlying compiler, and add several flags indicating the location of LAM's include files and libraries, and link in the relevant libraries.

What this all means is that compiling LAM/MPI programs is very simple:

shell$ mpicc myprogram.c -o myprogram

will compile a C program.

shell$ mpiCC myprogam.CC -o myCPPprogram
shell$ mpif77 myprogram.f -o myFprogram

will compile a C++ and Fortran LAM/MPI program, respectively.

Additionally, the wrapper compilers can be used to produce object files, which can be linked later:

shell$ mpicc -c foo.c
shell$ mpicc -c bar.c
shell$ mpicc -c baz.c
shell$ mpicc foo.o bar.o baz.o -o myprogram

It is not necessary to add -lmpi to any of the wrapper compiler commands; this is implicit in all the wrapper compilers when an executable is to be linked.

[ Top of page | Return to FAQ ]


2. How do I change the compilers that mpicc, mpic++/mpiCC, and mpif77 use?

The mpicc, mpic++/mpiCC, and mpif77 compilers are really "wrapper" compilers to an underlying compiler. That is, they only add several command line switches to the underlying compiler for the convenience of the user. These switches include the relevant LAM directories where the include and library files reside, the relevant LAM libraries to link in, etc.

As such, the underlying compiler can be selected both at compile time and via environment variables at run time. The environment variables LAMHCC, LAMHCP, and LAMHF77, if defined, override the underlying compiler that mpicc, mpic++/mpiCC, and mpif77 (respectively) will invoke.

For example, to override the default C compiler when using a Bourne shell (or sh derrivate shell):

shell$ LAMHCC=some_other_cc_compiler
shell$ export LAMHCC
shell$ mpicc myfile.c -o myfile

or, when using a C shell (or csh derrivative):

shell% setenv LAMHCC some_other_cc_compiler
shell% mpicc myfile.c -o myfile

A common use for this feature is to change the underlying Fortran 77 compiler to a Fortran 90 compiler for the mpif77 wrapper compiler.

Note that starting with LAM 6.3, it is not necessary to specify the -lmpi at the end of the compile line. It is still necessary with previous versions of LAM.

WARNING: It may not be a Good Idea to change the default compiler set from the one with which LAM was compiled, particularly the Fortran and C++ compilers (for Fortran/C++ programs, or user programs that use the Fortran or MPI 2 C++ bindings). This is because different compilers may use different internal linkage and/or have conflicts with header files and other system-level muckety-muck.

If you need to change the default compiler, you may wish to first ensure that you can link together .o files from the two compilers into a single executable that works properly first.

[ Top of page | Return to FAQ ]


3. My Fortran MPI program fails to link! Why?

When an MPI fortran fails to link, it is usually due to one of two common problems:

  • Not using the LAM mpif77 wrapper compiler.
  • Using a different underlying Fortran compiler than LAM was compiled with

The LAM Team strongly recommends using the wrapper compiler mpif77 to compile and link all Fortran MPI programs. mpif77 adds in any relevant compiler and/or linker flags to compile and link MPI programs. Note that this list of flags may be different depending on how LAM/MPI was configured, so it is not always safe to figure out what mpif77 is adding and then add those flags to your own compile/link command line manually (and not use mpif77).

Additionally, it is almost always important to use the same underlying Fortran compiler that LAM was compiled with. Although mpif77 allows the user to change the underlying Fortran compiler that is invoked, it is typically not a good idea to do this because different fortran compilers use different "name-mangling" schemes for their link-time symbols. A common symptom of this is when you change the underlying Fortran compiler that mpif77 uses, and then seeing link-time error messages similar to the following:

pi.o(.text+0x59): undefined reference to `MPI_INIT'
pi.o(.text+0x8a): undefined reference to `MPI_COMM_RANK'
pi.o(.text+0xb6): undefined reference to `MPI_COMM_SIZE'
pi.o(.text+0x5ba): undefined reference to `MPI_FINALIZE'

This typically indicates that LAM was configured and compiled with with Fortran compiler (that uses one particular name-mangling scheme), and the user's program was compiled with another underlying fortran compiler (that uses a different name-mangling scheme).

The solution is to use the same Fortran compiler that LAM was configured with. If you need to use a different Fortran compiler, you will need to re-configure and re-install LAM to use that Fortran compiler. Use the --with-fc switch to configure.

[ Top of page | Return to FAQ ]


4. My C++ MPI program fails to link! Why?

The common problems here are almost identical to when Fortran MPI programs fail to link:

  • Ensure to use the mpiCC (or mpic++) wrapper compiler
  • Use the same underlying C++ compiler that LAM was configured / compiled with

See the question "My Fortran MPI program fails to link! Why?" for more details.

[ Top of page | Return to FAQ ]


5. Can MPI jobs be checkpointed and restarted?
Applies to LAM 7.0 and above

Yes. Generally, for an MPI job to be checkpointable:

  • The same checkpoint/restart SSI module must be selected on all MPI processes in the MPI job.
  • All SSI modules selected for use in the MPI job must include support for checkpoint/restart. At the time of this writing, crtcp is the only RPI SSI module that includes support for checkpoint/restart. All collective SSI modules support checkpoint/restart.
  • Currently, only MPI-1 jobs can be checkpointed. The behavior of jobs performing non-local MPI-2 functions (e.g., dynamic functions to launch new MPI processes) in the presence of checkpoint and restart is undefined.
  • Checkpoints can only occur after all processes in the job invoke MPI_INIT and before any process invokes MPI_FINALIZE.

LAM/MPI currently only supports the Berkeley Lab Checkpoint-Restart (BLCR) system. Support for BLCR must be available in the LAM/MPI installation (this can be checked with the laminfo command).

Unfortunately, at the time of LAM/MPI's initial 7.0 release, the BLCR software was not yet available to the general public. Keep checking the BLCR web page for updates.

See the LAM/MPI User's Guide for more details about checkpointing and restarting MPI jobs.

[ Top of page | Return to FAQ ]


6. Does LAM/MPI support Myrinet?
Applies to LAM 7.0 and above

Yes. The gm RPI SSI module provides low latency, high bandwidth using the native Myrinet GM message passing library. You can check to see if your LAM/MPI installation has support for native GM message passing by running the laminfo command.

Unless some other module was selected as the default, the gm RPI SSI module should select itself as the RPI to be used if Myrinet hardware is available.

There is no need to specify to LAM which port to use; in most cases, the gm module will search and find an available port to use on every node in the MPI job.

Be sure to see the LAM/MPI User's Guide for more details about the gm RPI SSI module.

[ Top of page | Return to FAQ ]


7. Does LAM/MPI support Infiniband?

Yes. The ib RPI SSI module provides low latency, high bandwidth of Infiniband networks using the Mellanox Verbs Interface (VAPI). You can check to see if your LAM/MPI installation has support for IB message passing by running the laminfo command.

Be sure to see the LAM/MPI User's Guide for more details about the ib RPI SSI module.

[ Top of page | Return to FAQ ]


8. Can I run multi-process MPI applications on a single machine?

Yes. This is actually a common way to test parallel applications.

You can run all the processes of your parallel application on a single machine. LAM/ MPI allows you to launch multiple processes on a single machine, regardless of how many CPUs are actually present.

A common way of doing this is by using the default boot schema that is installed by LAM/MPI -- it contains a single node: the localhost. If you run lamboot with no arguments, the default boot schema will be used, and (assuming it hasn't been replaced), will launch a LAM universe consisting of just your local machine.

Then use the -np option to mpirun to specify the desired number of processes to launch. For example:

shell$ mpirun -np 4 my_mpi_application
will start 4 instances of my_mpi_application on the local machine. For more information see the the LAM/MPI User's Guide (including the Quick State Tutorial), the lamboot(1), mpirun(1), and bhost(5) man pages.

[ Top of page | Return to FAQ ]


9. How do I measure the performance of my parallel program?

In short, the only real meaningful metric of the performance of a parallel application is the wall clock execution time.

"User", "System", and "CPU" times are generally not useful because they only contain portions of the overall run-time, and have little meaning in a parallel application that spans multiple nodes (especially in heterogeneous situations). The use of wall-clock time encompasses the entirety of the performance of the parallel application -- all processes, all I/O, all message passing, etc. Trying to measure single components of this overall time is difficult (and usually impossible) since each system has many different sources of overhead (some less obvious than others).

[ Top of page | Return to FAQ ]


10. What directory does my LAM/MPI program run in on the remote nodes?

The default behavior for mpirun is to change all present working directories to the directory where mpirun was launched from. If this directory does not exist on the remote nodes, the present working directory is set to $HOME.

This behavior can be overridden with the -D or -wd command line switches to mpirun.

-wd can be used to set an arbitrary working directory. For example:

shell$ mpirun -wd /home/jshmo/mpi N my_mpi_program

will change the present working directory to /home/jshmo/mpi (on all nodes), and then attempt to run the my_mpi_program program. my_mpi_program must be in the $PATH (which may include ".", i.e., /home/jshmo/mpi).

A popular shortcut for mpirun is:

shell$ cd /home/jshmo/mpi
shell$ mpirun N `pwd`/my_mpi_program

although this assumes that /home/jshmo/mpi/my_mpi_program exists on all nodes.

[ Top of page | Return to FAQ ]


11. How does LAM find binaries that are invoked from mpirun?

When you mpirun a relative file name, LAM tries to find the application foo in your $PATH on all nodes to execute it. This follows the Unix/shell model of execution. If you mpirun an absolute filename, LAM simply tries to execute that absolute filename on all nodes. That is:

% mpirun C foo

will depend on the user's $PATH on each machine to find foo.

% mpirun C /home/jshmo/mpi/foo

will simply execute /home/jshmo/mpi/foo on all CPUs. The $PATH environment variable is not used in this case.

This model allows users to set the $PATH environment variable properly in their .cshrc, .profile, or other shell startup script to find the right executables for their architecture. That is, when running LAM in a heterogeneous situation, if the user's shell startup script sets the $PATH appropriately on each node, mpirun foo may find different foo executables on each node (which is probably what you want).

For example, if running on a cluster of Sun and HP workstations, if the user's .cshrc sets /home/jshmo/mpi/SUN in the $PATH on Sun machines, and sets /home/jshmo/mpi/HP in the $PATH on HP machines, mpirun foo will find the foo in the SUN directory on the Sun workstations, and find the foo in the HP directory on the HP workstations.

LAM attempts to change to the directory (on the remote nodes) of the same name as the pwd from where mpirun was invoked (unless overridden with the -wd or -D command line options to mpirun -- see the manual page for mpirun(1) for more details). This can affect the $PATH search if "." is in the $PATH.

[ Top of page | Return to FAQ ]


12. Why doesn't "mpirun -np 4 test" work?

If attempting to run a test program named "test" with a command similar to "mpirun -np 4 test" fails with an error message similar to "It seems that [at least] one of processes that was started with mpirun did not invoke MPI_INIT before quitting...", then you've run into a well-known problem that is not really an MPI issue.

More often than not, mpirun will find the unix utility "test" before it finds your MPI program named "test". This is typically because the unix utility test can be found early in your path, such as in /bin/test or /usr/bin/test. See the FAQ question "How does LAM find binaries that are invoked from mpirun?"

There are some easy solutions to this problem:

  • Rename your program to something other than test
  • Use the full pathname in the mpirun command line, such as "mpirun -np 4 /home/jshmo/mpi/test" (assuming that /home/jshmo/mpi/test is a valid executabled on all nodes)

[ Top of page | Return to FAQ ]


13. Can I run multiple LAM/MPI programs simultaneously?

Yes. Once you lamboot, you can run as many processes as you wish. For example, if you wish to run two different applications on a group of nodes:

% mpirun c0-3 program1
% mpirun c4-7 program2

program1 will be run on the first four CPUs, and program2 will be run on the last four CPUs. Neither program will interfere with each other; LAM guarantees that no messages from either application will overlap.

There is no need to issue a second lamboot.

[ Top of page | Return to FAQ ]


14. Can I pass environment variables to my LAM/MPI processes on the remote nodes upon invocation?
Applies to LAM 6.3 and above

Yes. The -x option to mpirun will explicitly pass environment variables to remote processes, and instantiate them before the user program is invoked (i.e., before main()). Multiple environment variables may be listed with the -x option, separated by commas:

% mpirun C -x DISPLAY,ALPHA_VALUE,BETA_VALUE myprogram

Additionally, all environment variables that have names that begin with LAM_MPI_ will automatically be exported to remote processes. The -nx option to mpirun will prevent this behavior. -x and -nx can be used together:

% mpirun C -nx -x DISPLAY,ALPHA_VALUE,BETA_VALUE myprogram

[ Top of page | Return to FAQ ]


15. mpirun -c and mpirun -np -- what's the difference?

They are very similar -- you can almost think of -c as a synonym for -np.

The only difference is that you still need to specify a set of LAM nodes with the -c option:

shell$ mpirun N -c 4 myprogram

will launch a total of 4 copies of myprogram, potentially using all nodes available in LAM. For example, if there are 4 nodes, then each node would get one process. If there are only 2 nodes, each node would get two processes. If there are 6 nodes, the first four would get a single process. More to the point:

shell$ mpirun n0-1 -c 4 myprogram

will launch a total of 4 processes on the first two nodes in LAM (i.e., 2 processes per node).

shell$ mpirun -np 4 myprogram

implies N (or C) -- a total of 4 processes will be launched, potentially using all nodes in LAM.

[ Top of page | Return to FAQ ]


16. What is "psuedo-tty support"? Do I want that?

Pseudo-tty support enables, among other things, line-buffered output from the remote nodes. This is usually a Good Thing -- the stdout and stderr from multiple nodes will not overlap each other on the same line. This is probably what you want -- orderly output from all your nodes, as opposed to jumbled and potentially overlapping output.

Starting with LAM 6.5, pseudo-tty support is enabled by default. It can be turned off with the -npty command line option to mpirun.

[ Top of page | Return to FAQ ]


17. Why can't my process read stdin?

When I execute the following code fragment:

int x;
printf("enter x =");
scanf("%d", &x );

and enter the value of x, as say 5, I get the following

5 command not found.

My application does not seem to be reading standard input.

The solution to this is to use the -w option to mpirun (or don't use the -nw option). This makes mpirun wait for your MPI application to terminate. If you use -nw, mpirun terminates after starting the application, and you return to the shell with your MPI application running in the background and competing with the shell for input.

In the example above the shell rather than the application got the input 5 and couldn't find any command named 5.

[ Top of page | Return to FAQ ]


18. Why can only rank 0 read from stdin?

LAM connects the stdin on all other ranks to /dev/null. There simply is no better way to route the standard input to all the different ranks.

If you need to use stdin on all of your ranks, you may wish to write a shell script that executes an xterm (or some other graphic command shell window) and then runs your MPI application. Take the following shell script as an example:

#!/bin/csh -f
echo "MPI app on `hostname`: $DISPLAY"
xterm -e my_mpi_application
exit 0

If you mpirun this shell script (and export the DISPLAY environment variable properly), an xterm window will pop up on your display for each MPI rank with your MPI application running in it.

Note that you will need to set up your environment to allow remote X requests to your DISPLAY. This is typically achieved with the xauth and/or xhost commands (not discussed here).

[ Top of page | Return to FAQ ]


19. What is the lamd RPI module?

In the lamd RPI module, all MPI messages are passed between ranks via the LAM daemons that are launched at lamboot. That is, for a message between process A and process B, the message actually follows the route:

Process A ---> LAM daemon on node where process A resides
                               |
                               |
Process B <--- LAM daemon on node where process B resides

Note that the message actually takes 3 hops before it reaches its destination. Also note that the LAM daemon where process A resides may be the same as the LAM daemon where process B resides -- if process A and process B reside on the same node, they share a common LAM daemon. In this case, there is only a total of two hops for the message to go from process A to process B.

All other RPI modules general send messages directly from one MPI process to its target process. For example, a message from process A to process B traverses the following path:

Process A ---> Process B

That is, the LAM daemons are not involved in the communication at all. All MPI messages take 1 hop to end up on the receiving side.

This begs the obvious question: why would you choose to use the lamd RPI module, given that its definitely slower than most other RPI modules? See the next FAQ question.

[ Top of page | Return to FAQ ]


20. Why would I use the lamd RPI module (vs. other RPI modules)?

Although the lamd RPI module is typically slower than other RPI modules (because MPI messages generally must take two extra hops before ending up at their destination), the lamd RPI has the following advantages over its peer RPI modules:

  • Third party applications such as XMPI can monitor message passing, and create reports on patterns and behavior of your MPI program.

  • The LAM daemon can exhibit true asynchronous message passing behavior. That is, the LAM daemon is effectively a separate thread of execution, and can therefore make progress on message passing even while the MPI application is not in an MPI function call. Since LAM/MPI is currently a single-threaded MPI implementation, most other RPI modules will only make progress on message passing while in MPI function calls.

    Therefore, MPI applications that can use latency-hiding techniques can actually achieve good performance from the lamd RPI module, even though the latency is higher than other RPI modules. This strategy has been discussed on the LAM mailing list.

[ Top of page | Return to FAQ ]


21. How do I run LAM/MPI user programs on multi-processor machines?

There are two options:

  • New "C" syntax has been added to mpirun (note that this section also applies to the "lamexec" command). When running on SMP machines, it is frequently desirable to group as many adjoining ranks as possible on a single node in order to maximize shared memory message passing. When used in conjunction with the extended bootschema syntax (that allows the specification of number of CPUs available on each host), the mpirun "C" syntax will run one executable on each available CPU, and will group adjoining MPI_COMM_WORLD ranks on the same nodes. For example, when running on two SMPs, the first having four CPUs, and the second having two CPUs, the following command:

    shell$ mpirun C my_mpi_program
    
    will run four copies of my_mpi_program on the four-way SMP (MPI_COMM_WORLD ranks 0 through 3), and will run two copies of my_mpi_program on the two-way SMP (MPI_COMM_WORLD ranks 4 and 5).

    Just like the "N" syntax in mpirun, the "C" syntax can also be used to indicate specific CPUs. For example:

    shell$ mpirun c4,5 my_mpi_program
    
    runs my_mpi_program on the fourth and fifth CPUs (i.e., the two-way SMP from the previous examples). "C" and "cX" syntax can also be combined:

    shell% mpirun c0 C master-slave
    
    could be used to launch a "master" process (i.e., rank 0 in MPI_COMM_WORLD) on CPU zero, a slave on each CPU (including rank 0. This may be desirable, for example, in situations where the master rank does very little computation).

    The behavior of "-np" has been altered to match the "C" semantics. "-np" now schedules across CPUs, not nodes. Using "-np 6" in the previous example would be the same as "C"; using "-np 4" would run one four copies of "foo" on the four-way SMP.

    Also note that "N", "nX", C", and "cX" syntax can all be used simultaneously, although it is not clear that this is really useful.

  • An application schema file can be used to specify exactly what is launched on each node. See the question "How do I run an MPMD program?"

[ Top of page | Return to FAQ ]


22. Can I mix multi-processor machines with uni-processor machines in a single LAM/MPI user program run?

Yes. LAM makes no restriction on what machines you can run on. LAM also allows you to specify which binaries (and how many) to run on each node. There are three ways to launch different numbers of jobs on nodes in a LAM cluster:

  • lamboot has been extended to understand multiple CPUs on a single host, and is intended to be used in conjunction with the new "C" mpirun syntax for running on SMP machines (see the section on mpirun). Multiple CPUs can be indicated in two ways: list a hostname multiple times, or add a "cpu=N" phrase to the host line (where "N" is the number of CPUs available on that host). For example, the following hostfile:

            blinky
            blinky
            blinky
            blinky
            pinky cpu=2
    
    indicates that there are four CPUs available on the "blinky" host, and that there are two CPUs available on the "pinky" host. Note that this works nicely in a PBS environment, because PBS will list a host multiple times when multiple vnodes on a single node have been allocated by the scheduler.

    After this boot schema has been successfully booted, you can run on all CPUs with:

           mpirun C foo
    

    This will run four copies of foo on blinky and two copies of foo on pinky.

  • Order the nodes in your boot schema such that your SMPs are grouped together. For example:

    uniprocessor1
    uniprocessor2
    smp2way1
    smp2way2
    smp2way3
    smp2way4
    smp4way1
    smp4way2
    

    In the above boot schema, the first two machines are uniprocessors, the next four are 2-way SMPs, and the last 2 are 4-way SMPs. Launching an SPMD MPI process with the "correct" number of processes on each node can be accomplished with the following mpirun command:

    % mpirun n0-1 n2-5 n2-5 n6-7 n6-7 n6-7 n6-7 myprogram
    

    While not elegant, it will definitely work. The following table lists the nodes on which each rank will be launched:

    Rank Node name Node ID
    0uniprocessor1n0
    1uniprocessor1n1
    2smp2way1n2
    3smp2way2n3
    4smp2way3n4
    5smp2way4n5
    6smp2way1n2
    7smp2way2n3
    8smp2way3n4
    9smp2way4n5
    10smp4way1n6
    11smp4way2n7
    12smp4way1n6
    13smp4way2n7
    14smp4way1n6
    15smp4way2n7

  • Use an application schema file. Especially with N-way SMPs (where N>2), or for a large number of non-uniform SMPs, it can be an easier method of launching than many command line paramters to mpirun. See the question "How do I run an MPMD application?"

[ Top of page | Return to FAQ ]


23. How do I run an MPMD program? More specifically -- how do I start different binaries on each node?

The easiest method is with the mpiexec command (v7.0 and above). Other cases can use an application schema file. There are two common scenarios where launching different executables on different nodes are necessary: MPMD jobs and heterogeneous jobs.

  • mpiexec examples:

    For MPMD jobs, multiple executables can be listed on the same mpiexec command line:

    shell$ mpiexec c0 manager : C worker
    

    will launch manager on CPU 0, and launch worker everywhere else. Heterogeneous environments can benefit from this behavior as well; since different executables need to be created for each architecture, it may be desirable to place them in the same directory and name them differently. For example:

    shell$ mpiexec -arch linux my_mpi_program.linux : \
        -arch solaris my_mpi_program.solaris
    

    This will launch my_mpi_program.linux on every Linux node found in the current universe, and my_mpi_program.solaris on every Solaris node in the LAM universe. Note that the default scheduling in this case is by node, not by CPU.

    See the "Hetrogeneity" section of this FAQ for more details, as well as that mpiexec(1) man page and the LAM/MPI User's Guide.

  • Alternatively, an application schema file(5) (frequently abbreviated "app schema") can be used to specify the exact binaries and run-time options that are started on each node in the LAM system. An app schema can be used to start different binaries on each node, specify different run-time options to different nodes, and/or start different numbers of binaries on each node.

    An app schema is an ASCII file that lists, one per line, a node ID (or group of node IDs), the binary to be run, and all run time options. For example:

    c0 manager
    C worker
    

    This application schema starts a manager process on c0, and also starts a worker process on all CPUs that LAM was booted on. Note that this puts two processes on c0 -- manager and the first of the worker processes. To avoid this overal, the following app schema can be used:

    c0 manager
    c1-8 worker
    

    Note that all LAM options must come before the binary file name (this is new starting with LAM 6.2b). User-specific command line arguments can come after the binary name.

    See the appschema(5) manual page for more information.

Note that LAM has no concept of scheduling on CPUs -- this is the responsibility of the operating system. The "C" notation is simply convenitent representation of how many jobs should be launched on each node. LAM will launch that many jobs and let the operating system handle all scheduling/CPU issues. So the prior example would not necessarily have manager and the first worker competing over the first CPU (assuming that c0 is located on an SMP); it means that LAM would schedule (M+1) programs on a machine with (M) processors.

[ Top of page | Return to FAQ ]


24. How do I mpirun across a heterogeneous cluster?

This question is discussed in detail in the Heterogeneous section of this FAQ.

[ Top of page | Return to FAQ ]


25. My LAM/MPI process doesn't seem to reach MPI_INIT. Why?

This can be for many reasons. Among the most common are:

  • Ensure that you are running the same version of LAM executables on all nodes.

  • Ensure that your application has been compiled/linked with the same version of LAM you are trying to run under.

  • Ensure that mpirun is able to find your application; ensure that some other application of the same name is not being found in your $PATH before the one that you expect to run.

    Ensure that cruft from previous LAM/MPI runs is not lingering around on the network; use the lamclean command.

[ Top of page | Return to FAQ ]


26. My LAM/MPI process seems to get "stuck" -- it runs for a while and then just hangs. Why?

This typically indicates either an error in communication patterns, or code that assumes a large degree of message buffering on LAM's part (which can result in deadlock).

The first case (an error in communication patterns) is usually fairly easy to find: use the daemon mode of communication (i.e., specify the -lamd option to mpirun), and use the mpitask command to check the state of the running program. If the program does become deadlocked due to incorrect communication patterns, mpitask will show the messages that are queued up within LAM, as well as the MPI function that is blocked in each process.

The second case is typically due to sending large amounts of messages (or a small amount of large messages) without matching receives. More to the point, in poorly ordered message passing sequences, LAM's message queues (or the underlying operating system or native message passing system's queues) can get full with pending messages that have not been received yet. Consider the following code snipit:

for (i = 0; i < size; i++)
  if (i != my_rank) 
    MPI_Send(buf[i], MSG_SIZE, MPI_BYTE, i, tag, 
             MPI_COMM_WORLD);

for (i = 0; i < size; i++)
  if (i != my_rank)
    MPI_Recv(buf[i], MSG_SIZE, MPI_BYTE, i, tag, 
             MPI_COMM_WORLD, &status);

Notice that all the sends must complete before any of the receives will be posted. For small values of MSG_SIZE, LAM will execute this program correctly. For larger sizes of MSG_SIZE, the program will "hang", because LAM's internal message buffers/queues have been exhausted, and deadlock while waiting for them to drain (this is actually an issue for all MPI implementations -- it is not unique to LAM. LAM actually has more buffering capabilities than most other MPI's).

[ Top of page | Return to FAQ ]


27. TCP performance under Linux 2.2.0-2.2.9 just plain sucks! Why?

This is a problem in the 2.2.0-2.2.9 series of Linux kernels. There is a specific size message at which LAM performan drops off dramatically. There has been considerable discussion on the Linux kernel newsgroups about whose fault this is -- the kernel, or the application.

The problem appears to have been fixed starting with Linux 2.2.10. If you have a kernel before this version, you should probably upgrade.

A much more comprehensive discussion of the problem is available here.

[ Top of page | Return to FAQ ]