Skip to content

pMatlab Job Problems

This level of troubleshooting assumes that you are able to connect to the SuperCloud system and submit a job. If you are unable to connect please email supercloud@mit.edu with the error that you are experiencing.

Below, we list some steps to help you sanity check your SuperCloud configuration and debug the problem. We also list some common errors and the steps to resolve the problem, and where to find your output and results when running pMatlab jobs. If you still run into problems, email supercloud@mit.edu with the error that you are experiencing.

Steps to Sanity Check and Debug

If you are able to submit a job but receive errors that do not seem related to your application, we recommend the following strategy:

1st Step - Is my Configuration Correct?

An easy way to check on potential configuration issues is to run the Param_Sweep example. There are instructions for running Param_Sweep on the Verifying your pMatlab setup page.

2nd Step - You can Run Param_Sweep, but your Code fails

Your configuration is correct, so the next step is to look at the output from your job. In this case, we are not looking at files that are application specific but rather at the output that is normally sent to the command window. The compute nodes produce output which is then directed to a .out file in the MatMPI directory (or a subdirectory within MatMPI) of your working MATLAB® directory.

The first task is to see that all of the remote processors started a valid MATLAB® session and created the .out file. To check this you can read the page on where to find pMatlab output files, or briefly:

  • Go to the MatMPI directory in your working directory
  • To find your output log files, look for subdirectories within the MatMPI directory with names like this: p<start-pid>-p<end-pid>_<compute-node-name> . where:
    • p<start-pid> is the id of the first process running on the compute node whose name is <compute-node-name>
    • p<end-pid> is the id of the last process running on the compute node whose name is <compute-node-name> If your job ran on multiple compute nodes, there will be several of these subdirectories.
  • Look at the list of files in those subdirectories - there should be a total of n files with the .out extension
  • Each of the .out files should include your filename, the processor id and the .out extension - e.g. for Param_Sweep on 4 processors you would see: param_sweep_parallel_v2.0.out, param_sweep_parallel_v2.1.out, param_sweep_parallel_v2.2.out and param_sweep_parallel_v2.3.out.

If any .out file is missing, this is a problem and should be reported to the SuperCloud Team by sending email to supercloud@mit.edu. Often what the user sees is that their application hangs on an agg command and with a bit of inspection they discover that one remote processor didn't properly start its MATLAB® session.

3rd Step - Errors, Warnings, and Messages from Remote MATLAB® Sessions

If all of the remote MATLAB® sessions started properly, the next step is to check each of the .out files. Each .out file contains all of the diary or screen output from a remote node. In most cases these files should have the same information, with the exception of numerical data, as that in the command window on your local machine and should provide some indication of a warning or error related to the failure of the compute job. If the error message is not clear to you, or you are unsure how to correct the error send email, with the error information, to supercloud@mit.edu.

If error messages were sent to STDERR by any of the remote MATLAB® sessions, these error messages will appear in the .out files. These error messages will also be written to the .err file (located in the MatMPI directory). The process id (pid) of the process which generated the error will be pre-pended to the error message so you can look in that process's .out file for additional output that may help you debug the cause of the error.

Problems Starting MATLAB®

If run into problems starting MATLAB® (e.g., you never get the MATLAB® prompt >>), try deleting the .matlab directory (note the leading . before "matlab") in your SuperCloud home directory, then restart MATLAB®.

Common Problems Launching or Running Your pMatlab Job

In this section we provide solutions to common problems and errors you might encounter when running your pMatlab jobs. If you don't see a solution for your problem below, the Getting Help page may point you in the right direction.

fl:filesystem:NotDirectoryError

MATLAB® often runs into errors with its cache directory when it is created (by default) in your home directory, as ~/.matlab. You may see an error from MATLAB® that includes the text fl:filesystem:notdirectoryerror.

In order to avoid these potential issues, you can point MATLAB® to a different location for creating its cache directory by setting the MATLAB_PREFDIR environment variable.

To change the MATLAB® cache directory, create a directory with your username on the local file system that MATLAB® can use as its cache directory:

mkdir /state/partition1/user/$USER/

Set the MATLAB_PREFDIR environment variable to the new directory. It's best to add this line to your ~/.bashrc file.

export MATLAB_PREFDIR=/state/partition1/user/$USER/

Undefined function or variable 'MatMPIdefs1'

The mostly likely reason for this error is that your path is not set properly and you are missing the ./MatMPI path. Send email to supercloud@mit.edu and a member of the SuperCloud team will help you resolve the path issue.

OUT OF MEMORY Errors

There are two primary options for enabling large compute jobs: requesting more memory per process by requesting more slots and dividing and distributing the workload. Each core comes with an amount of memory (see the Systems and Software page for up to date numbers), and so asking for more slots will get you more memory for each pMatlab process. If you have given each process an entire node and you are still running out of memory, you will to either try to reduce the amount of memory used by your application, or divide and distribute your workload.

Dividing and distributing the workload

If you don't need to aggregate a large data structure, you can create a distributed matrix using more processors so that each processor is working on a smaller portion of the data. The Param_Sweep example in your SuperCloud home directory ~/examples/Param_Sweep provides an example of how to do this.

We recommend our online course, Practical HPC. The PGAS Example: pMatlab Implementation section, within the Distributed Applications module provides a detailed introduction, working step-by-step through the Parameter Sweep application that is in the examples directory in your home directory. It should take approximately 30-45 minutes to work through the section.

Your pMatlab program hangs

If your pMatlab program seems to be hung (the output logs are empty or have not had additional data written to them, or the LLstatcommand shows the job is not in the RUNNING state), you should delete the job.

The preferred method for deleting a job is to use LLkill <jobID>. You obtain the <jobID> by using the command LLstat These commands are described on the General Job Management Commands page.

Finding Your Output and Results

Please see the page Finding pMatlab Output for details on how to find your output and results.