Tips for Analyzing Large Datasets
If you are analyzing a dataset with a large number of files, avoid using
dir
, ls
, and any other commands that scan the file system. Running
thousands of file system scans can overload the file system. Here are
some tips for working with large datasets.
Construct a filename
Instead of getting a list of all the files, it is better to construct a filename and check to see if that filename exists before reading or creating the file.
Create another file that lists the full file paths of each file in the dataset
Before launching your job, create a text file that lists the full paths
of each file in the dataset. Each processor that is processing the data
can read in this single file and won't need to execute ls
or any other
metadata intensive operations to get the filenames. This also makes it
easy to divide up the files amongst processors in a nice "mimo" (multiple inputs, multiple outputs) manner
that minimizes the number of starts and stops, compared to a "siso" (single input, single output)
model.
"siso" vs "mimo"
"siso" is an acronym for single input, single output. In a siso model, a single process calls your application multiple times to process all of the input files that are assigned to it. Each time your application is called, the software or program is loaded, it processes a single input file, then the software is unloaded. unloaded.
"mimo" is an acronym for multiple input, multiple output. The advantage of using a mimo model is that your application is called and loaded just once to process all of its assigned input files. When your application starts, it should read the text file containing the list of the data files and process the appropriate subset of files before being unloaded.
See the LLMapReduce page, particularly the section discussion the
--apptype=APPLICATION-TYPE
option for a more detailed explanation of
the mimo model and the changes to your application that are required in
order to operate in mimo mode.
Generating the list of files
To generate the text file which will contain a list of files to process, log into the login node via ssh or in a Jupyter Notebook Terminal window and issue the following commands at the Linux Command line:
$ cd datasetPath
$ lfs find "$(pwd -P)" | grep "fileSearchPattern" | sort > ./inputFilename
where:
datasetPath
is the path to your datasetfileSearchPattern
is a Linux regular expression that thegrep
command uses to find filenames that you want to include,inputFilename
is the name of the text file that will contain a list of all your data filenames
Note that the lfs
command in the example above can only be used if
your files reside on the Lustre filesystem (somewhere within
/home/gridsan). If your data files are located on a node's local disk
(in $TMPDIR
or /state/partition1
), drop lfs
from the beginning of
the command above.
Unfortunately, you can't use regular wildcards (for example: *.txt) to
identify the files that you want to match because the grep
command
expects a regular expression. You can find information on how to use
regular expressions in Linux from these external websites (link will
open in a new tab):
- Using Grep & Regular Expressions to Search for Text Patterns in Linux
- Regex tutorial - A quick cheatsheet by examples
- Linux Tutorial - Cheat Sheet
Example 1
Generate a file containing a list of files within
~/examples/LLGrid_MapReduce/MultiLevelData/data
that end in .txt
and
save the list to a file called txt-files
From the Linux command line, execute the following commands
- Command 1:
- $ cd ~/examples/LLGrid_MapReduce/MultiLevelData/data
- Command 2:
- $ lfs find "$(pwd -P)" | grep "\.txt$" | sort > ./txt-files
The contents of the file txt-files is below
$ cat ./txt-files
/home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/a/a1/a11.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/a/a1/a12.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/a/a1/a13.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/a/a2/a21.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/a/a2/a22.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/a/a2/a23.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/a/a3/a31.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/a/a3/a32.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/a/a3/a33.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/b/b1/b11.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/b/b1/b12.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/b/b1/b13.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/b/b2/b21.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/b/b2/b22.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/b/b2/b23.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/b/b3/b31.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/b/b3/b32.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/b/b3/b33.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/c/c1/c11.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/c/c1/c12.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/c/c1/c13.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/c/c2/c21.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/c/c2/c22.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/c/c2/c23.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/c/c3/c31.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/c/c3/c32.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/c/c3/c33.txt
Example 2
Generate a file containing a list of files within
~/examples/LLGrid_MapReduce/MultiLevelData/data
that begin with a
lowercase letter, followed by the number 2, and end in .txt
and save
the list to a file called az-2-txt-files
From the Linux command line, execute the following commands
- Command 1:
$ cd ~/examples/LLGrid_MapReduce/MultiLevelData/data
- Command 2:
$ lfs find "$(pwd -P)" | grep "[a-z]2.*\.txt$" | sort > ./az-2-txt-files
The contents of the file az-2-txt-files is below
$ cat ./az-2-txt-files
/home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/a/a2/a21.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/a/a2/a22.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/a/a2/a23.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/b/b2/b21.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/b/b2/b22.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/b/b2/b23.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/c/c2/c21.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/c/c2/c22.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/c/c2/c23.txt