Tips for Analyzing Large Datasets

If you are analyzing a dataset with a large number of files, avoid using dir, ls, and any other commands that scan the file system. Running thousands of file system scans can overload the file system. Here are some tips for working with large datasets.

Construct a filename

Instead of getting a list of all the files, it is better to construct a filename and check to see if that filename exists before reading or creating the file.

Create another file that lists the full file paths of each file in the dataset

Before launching your job, create a text file that lists the full paths of each file in the dataset. Each processor that is processing the data can read in this single file and won't need to execute ls or any other metadata intensive operations to get the filenames. This also makes it easy to divide up the files amongst processors in a nice "mimo" (multiple inputs, multiple outputs) manner that minimizes the number of starts and stops, compared to a "siso" (single input, single output) model.

"siso" vs "mimo"

"siso" is an acronym for single input, single output. In a siso model, a single process calls your application multiple times to process all of the input files that are assigned to it. Each time your application is called, the software or program is loaded, it processes a single input file, then the software is unloaded. unloaded.

"mimo" is an acronym for multiple input, multiple output. The advantage of using a mimo model is that your application is called and loaded just once to process all of its assigned input files. When your application starts, it should read the text file containing the list of the data files and process the appropriate subset of files before being unloaded.

See the LLMapReduce page, particularly the section discussion the --apptype=APPLICATION-TYPE option for a more detailed explanation of the mimo model and the changes to your application that are required in order to operate in mimo mode.

Generating the list of files

To generate the text file which will contain a list of files to process, log into the login node via ssh or in a Jupyter Notebook Terminal window and issue the following commands at the Linux Command line:

$ cd datasetPath
$ lfs find "$(pwd -P)" | grep "fileSearchPattern" | sort > ./inputFilename

where:

datasetPath is the path to your dataset
fileSearchPattern is a Linux regular expression that the grep command uses to find filenames that you want to include,
inputFilename is the name of the text file that will contain a list of all your data filenames

Note that the lfs command in the example above can only be used if your files reside on the Lustre filesystem (somewhere within /home/gridsan). If your data files are located on a node's local disk (in $TMPDIR or /state/partition1), drop lfs from the beginning of the command above.

Unfortunately, you can't use regular wildcards (for example: *.txt) to identify the files that you want to match because the grep command expects a regular expression. You can find information on how to use regular expressions in Linux from these external websites (link will open in a new tab):

Example 1

Generate a file containing a list of files within ~/examples/LLGrid_MapReduce/MultiLevelData/data that end in .txt and save the list to a file called txt-files

From the Linux command line, execute the following commands - Command 1: - $ cd ~/examples/LLGrid_MapReduce/MultiLevelData/data - Command 2: - $ lfs find "$(pwd -P)" | grep "\.txt$" | sort > ./txt-files

The contents of the file txt-files is below

$ cat ./txt-files

/home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/a/a1/a11.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/a/a1/a12.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/a/a1/a13.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/a/a2/a21.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/a/a2/a22.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/a/a2/a23.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/a/a3/a31.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/a/a3/a32.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/a/a3/a33.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/b/b1/b11.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/b/b1/b12.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/b/b1/b13.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/b/b2/b21.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/b/b2/b22.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/b/b2/b23.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/b/b3/b31.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/b/b3/b32.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/b/b3/b33.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/c/c1/c11.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/c/c1/c12.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/c/c1/c13.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/c/c2/c21.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/c/c2/c22.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/c/c2/c23.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/c/c3/c31.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/c/c3/c32.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/c/c3/c33.txt

Example 2

Generate a file containing a list of files within ~/examples/LLGrid_MapReduce/MultiLevelData/data that begin with a lowercase letter, followed by the number 2, and end in .txt and save the list to a file called az-2-txt-files

From the Linux command line, execute the following commands

Command 1:
- $ cd ~/examples/LLGrid_MapReduce/MultiLevelData/data
Command 2:
- $ lfs find "$(pwd -P)" | grep "[a-z]2.*\.txt$" | sort > ./az-2-txt-files

The contents of the file az-2-txt-files is below

$ cat ./az-2-txt-files

/home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/a/a2/a21.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/a/a2/a22.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/a/a2/a23.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/b/b2/b21.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/b/b2/b22.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/b/b2/b23.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/c/c2/c21.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/c/c2/c22.txt /home/gridsan/AN23082/examples/LLGrid_MapReduce/MultiLevelData/data/c/c2/c23.txt