File Locking

The SuperCloud Lustre network file system (home directories and shared directories reside on the network file system) do not support file locking.

In general, there are 2 ways to fix this problem:

Disable file locking by setting an environment variable that the package uses
Have your code or the package use the local disk on the compute node, where file locking is permitted

On this page, we'll provide instructions on how to fix this problem for various applications. Please email us at supercloud@mit.edu if you encounter a file locking issue with an application that isn't included here.

HDF5

Here is an example error message that you might see from HDF5 when it can't lock a file:

IOError: Unable to create file (file locking disabled on this file system (use HDF5_USE_FILE_LOCKING environment variable to override), errno = 38, error message = 'Function not implemented')

You can disable file locking in HDF5 by setting the HDF5_USE_FILE_LOCKING environment variable to false. This variable can be set in various places.

In your .bashrc or .bash_profile

You can disable file locking by setting the HDF5_USE_FILE_LOCKING environment variable. To disable file locking, add this line to your ~/.bashrc or ~/.bash_profile file:

export HDF5_USE_FILE_LOCKING='FALSE'

Jupyter Notebook Jobs

If you are running a Jupyter Notebook, you can add the line below to the file ~/.jupyter/llsc_notebook_bashrc (you'll have to create the file if it isn't there). This file is loaded at the start of Jupyter jobs, much like a bashrc file when you log into the terminal.

export HDF5_USE_FILE_LOCKING=FALSE

For more information about using environment variables in Jupyter Notebooks, see the note on our Jupyter Notebooks page.

Python Jobs

If you are running python code, you can set the environment variable at the beginning of your python code:

import os
os.environ["HDF5_USE_FILE_LOCKING"] = "FALSE"

Use the Local Disk

Each of the compute nodes contains a local disk where file locking is permitted. You can use the $TMPDIR area or /state/partition1/user/<username> area for files that need file locking capability.

$TMPDIR

The $TMPDIR environment variable points to a temporary directory on the local disk of the compute node. Note that $TMPDIR is created by the scheduler and points to a temporary directory that will not exist after the job completes.

/state/partition1

If you would like your files to persist after the job completes, you can create your own subdirectory in the /state/partition1/user area on the local disk. However, since the /state/partition1 directory is on the local disk of each compute node (each compute node will have different files in its /state/partition1 directory), as a final step of your job, you may want to copy the files from your /state/partition1 subdirectory to a shared directory, or to your home directory.

If you use the /state/partition1 directory for your files, your code should create the directory /state/partition1/user/$USER and create any desired subdirectories within that directory. Please do not write your files in the /state/partition1 directory itself - create a subdirectory with your username and save your files there.

Hugging Face

You can direct Hugging Face to use local storage for files that need file locking capability by setting the HF_HOME environment variable to point to the local disk. This variable can be set in various places:

In ~/.bashrc or ~/.bash_profile file - the variable will always be set (this is a "set and forget" approach). You should also create the directory:
```
export HF_HOME=/state/partition1/user/$(whoami)/hf
mkdir -p $(HF_HOME)
```
In ~/.jupyter/llsc_notebook_bashrc - the variable will always be set when your run a Jupyter Notebook. You can add the export and mkdir statements from the previous bullet to the file ~/.jupyter/llsc_notebook_bashrc (you'll have to create the file if it isn't there). This file is loaded at the start of Jupyter jobs, much like a .bashrc file when you log into the terminal. For more information about using environment variables in Jupyter Notebooks, see the note on our Jupyter Notebooks page.

At the beginning of your python code - the variable would be set only when your python code is running:

import os
username = os.getenv("USER")
os.environ["HF_HOME"] = "/state/partition1/user/"+ username + "/hf"
os.makedirs(os.getenv("HF_HOME"))

We have an example for downloading the Hugging Face data to the local disk of the login node, where file locking is enabled, then copy it back to your home directory. At the start of each job, you'd then copy it to the local disk of the compute node you are on. You can the example by clicking here.

The run.sh script loads the model/dataset, and then launches the job, batch_bert_v0.sh. The relevant lines are 9-14, 24, 25 in run.sh, and 28-32 in batch_bert_v0.sh.