File Locking
The SuperCloud Lustre network file system (home directories and shared directories reside on the network file system) do not support file locking.
In general, there are 2 ways to fix this problem:
- Disable file locking by setting an environment variable that the package uses
- Have your code or the package use the local disk on the compute node, where file locking is permitted
On this page, we'll provide instructions on how to fix this problem for various applications. Please email us at supercloud@mit.edu if you encounter a file locking issue with an application that isn't included here.
HDF5
Here is an example error message that you might see from HDF5 when it can't lock a file:
IOError: Unable to create file (file locking disabled on this file system (use HDF5_USE_FILE_LOCKING environment variable to override), errno = 38, error message = 'Function not implemented')
You can disable file locking in HDF5 by setting the
HDF5_USE_FILE_LOCKING
environment variable to false. This variable can
be set in various places.
In your .bashrc or .bash_profile
You can disable file locking by setting the HDF5_USE_FILE_LOCKING
environment variable. To disable file locking, add this line to your
~/.bashrc
or ~/.bash_profile
file:
export HDF5_USE_FILE_LOCKING='FALSE'
Jupyter Notebook Jobs
If you are running a Jupyter Notebook, you can add the line below to the
file ~/.jupyter/llsc_notebook_bashrc
(you'll have to create the file if
it isn't there). This file is loaded at the start of Jupyter jobs, much
like a bashrc file when you log into the terminal.
export HDF5_USE_FILE_LOCKING=FALSE
For more information about using environment variables in Jupyter Notebooks, see the note on our Jupyter Notebooks page.
Python Jobs
If you are running python code, you can set the environment variable at the beginning of your python code:
Use the Local Disk
Each of the compute nodes contains a local disk where file locking is
permitted. You can use the $TMPDIR
area or
/state/partition1/user/<username>
area for files that need file locking
capability.
$TMPDIR
The $TMPDIR
environment variable points to a temporary directory on the
local disk of the compute node. Note that $TMPDIR
is created by the
scheduler and points to a temporary directory that will not exist after
the job completes.
/state/partition1
If you would like your files to persist after the job completes, you can
create your own subdirectory in the /state/partition1/user
area on the local
disk. However, since the /state/partition1
directory is on the local
disk of each compute node (each compute node will have different files
in its /state/partition1
directory), as a final step of your job, you
may want to copy the files from your /state/partition1
subdirectory to a
shared directory, or to your home directory.
If you use the /state/partition1
directory for your files, your code
should create the directory /state/partition1/user/$USER
and create
any desired subdirectories within that directory. Please do not write
your files in the /state/partition1
directory itself - create a
subdirectory with your username and save your files there.
Hugging Face
You can direct Hugging Face to use local storage for files that need file
locking capability by setting the HF_HOME
environment variable to point
to the local disk. This variable can be set in various places:
-
In
~/.bashrc
or~/.bash_profile
file - the variable will always be set (this is a "set and forget" approach). You should also create the directory: -
In
~/.jupyter/llsc_notebook_bashrc
- the variable will always be set when your run a Jupyter Notebook. You can add theexport
andmkdir
statements from the previous bullet to the file~/.jupyter/llsc_notebook_bashrc
(you'll have to create the file if it isn't there). This file is loaded at the start of Jupyter jobs, much like a.bashrc
file when you log into the terminal. For more information about using environment variables in Jupyter Notebooks, see the note on our Jupyter Notebooks page. -
At the beginning of your python code - the variable would be set only when your python code is running:
We have an example for downloading the Hugging Face data to the local disk of the login node, where file locking is enabled, then copy it back to your home directory. At the start of each job, you'd then copy it to the local disk of the compute node you are on. You can the example by clicking here.
The run.sh
script loads the model/dataset, and then launches the
job, batch_bert_v0.sh
. The relevant lines are 9-14, 24, 25 in run.sh
, and 28-32 in
batch_bert_v0.sh
.