pMatlab Job Problems
This level of troubleshooting assumes that you are able to connect to the SuperCloud system and submit a job. If you are unable to connect please email supercloud@mit.edu with the error that you are experiencing.
Below, we list some steps to help you sanity check your SuperCloud configuration and debug the problem. We also list some common errors and the steps to resolve the problem, and where to find your output and results when running pMatlab jobs. If you still run into problems, email supercloud@mit.edu with the error that you are experiencing.
Steps to Sanity Check and Debug
If you are able to submit a job but receive errors that do not seem related to your application, we recommend the following strategy:
1st Step - Is my Configuration Correct?
An easy way to check on potential configuration issues is to run the
Param_Sweep example. There are instructions for running Param_Sweep
on the Verifying your pMatlab setup page.
2nd Step - You can Run Param_Sweep, but your Code fails
Your configuration is correct, so the next step is to look at the output
from your job. In this case, we are not looking at files that are
application specific but rather at the output that is normally sent to
the command window. The compute nodes produce output which is
then directed to a .out
file in the MatMPI directory (or a
subdirectory within MatMPI) of your working MATLAB® directory.
The first task is to see that all of the remote processors started a
valid MATLAB® session and created the .out
file. To check this you can
read the page on where to find pMatlab output files, or briefly:
- Go to the MatMPI directory in your working directory
- To find your output log files, look for subdirectories within the
MatMPI directory with names like this:
p<start-pid>-p<end-pid>_<compute-node-name>
. where:p<start-pid>
is the id of the first process running on the compute node whose name is<compute-node-name>
p<end-pid>
is the id of the last process running on the compute node whose name is<compute-node-name>
If your job ran on multiple compute nodes, there will be several of these subdirectories.
- Look at the list of files in those subdirectories - there should be
a total of
n
files with the.out
extension - Each of the
.out
files should include your filename, the processor id and the.out
extension - e.g. for Param_Sweep on 4 processors you would see:param_sweep_parallel_v2.0.out
,param_sweep_parallel_v2.1.out
,param_sweep_parallel_v2.2.out
andparam_sweep_parallel_v2.3.out
.
If any .out
file is missing, this is a problem and should be
reported to the SuperCloud Team by sending email to supercloud@mit.edu.
Often what the user sees is that their application hangs on an agg
command and with a bit of inspection they discover that one remote
processor didn't properly start its MATLAB® session.
3rd Step - Errors, Warnings, and Messages from Remote MATLAB® Sessions
If all of the remote MATLAB® sessions started properly, the next step is
to check each of the .out
files. Each .out
file contains all of the
diary or screen output from a remote node. In most cases these files
should have the same information, with the exception of numerical data,
as that in the command window on your local machine and should provide
some indication of a warning or error related to the failure of the
compute job. If the error message is not clear to you, or you are unsure
how to correct the error send email, with the error information, to
supercloud@mit.edu.
If error messages were sent to STDERR by any of the remote MATLAB®
sessions, these error messages will appear in the .out
files. These
error messages will also be written to the .err
file (located in the
MatMPI directory). The process id (pid) of the process which generated
the error will be pre-pended to the error message so you can look in
that process's .out
file for additional output that may help you debug
the cause of the error.
Problems Starting MATLAB®
If run into problems starting MATLAB® (e.g., you never get the MATLAB®
prompt >>
), try deleting the .matlab
directory (note the leading
.
before "matlab") in your SuperCloud home directory, then restart
MATLAB®.
Common Problems Launching or Running Your pMatlab Job
In this section we provide solutions to common problems and errors you might encounter when running your pMatlab jobs. If you don't see a solution for your problem below, the Getting Help page may point you in the right direction.
fl:filesystem:NotDirectoryError
MATLAB® often runs into errors with its cache directory when it is
created (by default) in your home directory, as ~/.matlab
. You may see
an error from MATLAB® that includes the text
fl:filesystem:notdirectoryerror
.
In order to avoid these potential issues, you can point MATLAB® to a
different location for creating its cache directory by setting the MATLAB_PREFDIR
environment variable.
To change the MATLAB® cache directory, create a directory with your username on the local file system that MATLAB® can use as its cache directory:
Set the MATLAB_PREFDIR
environment variable to the new directory. It's best to add this line to your ~/.bashrc
file.
Undefined function or variable 'MatMPIdefs1'
The mostly likely reason for this error is that your path is not set
properly and you are missing the ./MatMPI
path. Send email to
supercloud@mit.edu and a member of the SuperCloud team will help you
resolve the path issue.
OUT OF MEMORY
Errors
There are two primary options for enabling large compute jobs: requesting more memory per process by requesting more slots and dividing and distributing the workload. Each core comes with an amount of memory (see the Systems and Software page for up to date numbers), and so asking for more slots will get you more memory for each pMatlab process. If you have given each process an entire node and you are still running out of memory, you will to either try to reduce the amount of memory used by your application, or divide and distribute your workload.
Dividing and distributing the workload
If you don't need to aggregate a large data structure, you can create a
distributed matrix using more processors so that each processor is
working on a smaller portion of the data. The Param_Sweep example in
your SuperCloud home directory ~/examples/Param_Sweep
provides an example of
how to do this.
We recommend our online course, Practical HPC. The PGAS Example: pMatlab Implementation section, within the Distributed Applications module provides a detailed introduction, working step-by-step through the Parameter Sweep application that is in the examples directory in your home directory. It should take approximately 30-45 minutes to work through the section.
Your pMatlab program hangs
If your pMatlab program seems to be hung (the output logs are empty or
have not had additional data written to them, or the LLstat
command shows
the job is not in the RUNNING state), you should delete the job.
The preferred method for deleting a job is to use
LLkill <jobID>
. You obtain the <jobID>
by using the command
LLstat
These commands are described on the General Job
Management Commands
page.
Finding Your Output and Results
Please see the page Finding pMatlab Output for details on how to find your output and results.