File Locking
The SuperCloud Lustre network file system (home directories and shared directories reside on the network file system) do not support file locking.
In general, there are 2 ways to fix this problem:
- Disable file locking by setting an environment variable that the package uses
- Have your code or the package use the local disk on the compute node, where file locking is permitted
On this page, we'll provide instructions on how to fix this problem for various applications. Please email us at supercloud@mit.edu if you encounter a file locking issue with an application that isn't included here.
HDF5
Here is an example error message that you might see from HDF5 when it can't lock a file:
IOError: Unable to create file (file locking disabled on this file system (use HDF5_USE_FILE_LOCKING environment variable to override), errno = 38, error message = 'Function not implemented')
You can disable file locking in HDF5 by setting the
HDF5_USE_FILE_LOCKING environment variable to false. This variable can
be set in various places.
In your .bashrc or .bash_profile
You can disable file locking by setting the HDF5_USE_FILE_LOCKING
environment variable. To disable file locking, add this line to your
~/.bashrc or ~/.bash_profile file:
export HDF5_USE_FILE_LOCKING='FALSE'
Jupyter Notebook Jobs
If you are running a Jupyter Notebook, you can add the line below to the
file ~/.jupyter/llsc_notebook_bashrc (you'll have to create the file if
it isn't there). This file is loaded at the start of Jupyter jobs, much
like a bashrc file when you log into the terminal.
export HDF5_USE_FILE_LOCKING=FALSE
For more information about using environment variables in Jupyter Notebooks, see the note on our Jupyter Notebooks page.
Python Jobs
If you are running python code, you can set the environment variable at the beginning of your python code:
Use the Local Disk
Each of the compute nodes contains a local disk where file locking is
permitted. You can use the $TMPDIR area or
/state/partition1/user/<username> area for files that need file locking
capability.
$TMPDIR
The $TMPDIR environment variable points to a temporary directory on the
local disk of the compute node. Note that $TMPDIR is created by the
scheduler and points to a temporary directory that will not exist after
the job completes.
/state/partition1
If you would like your files to persist after the job completes, you can
create your own subdirectory in the /state/partition1/user area on the local
disk. However, since the /state/partition1 directory is on the local
disk of each compute node (each compute node will have different files
in its /state/partition1 directory), as a final step of your job, you
may want to copy the files from your /state/partition1 subdirectory to a
shared directory, or to your home directory.
If you use the /state/partition1 directory for your files, your code
should create the directory /state/partition1/user/$USER and create
any desired subdirectories within that directory. Please do not write
your files in the /state/partition1 directory itself - create a
subdirectory with your username and save your files there.
Hugging Face
You can direct Hugging Face to use local storage for files that need file
locking capability by setting the HF_HOME environment variable to point
to the local disk. This variable can be set in various places:
- 
In ~/.bashrcor~/.bash_profilefile - the variable will always be set (this is a "set and forget" approach). You should also create the directory:
- 
In ~/.jupyter/llsc_notebook_bashrc- the variable will always be set when your run a Jupyter Notebook. You can add theexportandmkdirstatements from the previous bullet to the file~/.jupyter/llsc_notebook_bashrc(you'll have to create the file if it isn't there). This file is loaded at the start of Jupyter jobs, much like a.bashrcfile when you log into the terminal. For more information about using environment variables in Jupyter Notebooks, see the note on our Jupyter Notebooks page.
- 
At the beginning of your python code - the variable would be set only when your python code is running: 
We have an example for downloading the Hugging Face data to the local disk of the login node, where file locking is enabled, then copy it back to your home directory. At the start of each job, you'd then copy it to the local disk of the compute node you are on. You can the example by clicking here.
The run.sh
script loads the model/dataset, and then launches the
job, batch_bert_v0.sh. The relevant lines are 9-14, 24, 25 in run.sh, and 28-32 in
batch_bert_v0.sh.