Skip to content

Datasets

SuperCloud hosts a variety of large datasets that can be valuable for your research. These datasets are shared by program groups and/or organizational groups in order to prevent replication of the data and to allow sharing of data and code.

You can find and access the datasets that are available on the SuperCloud system by looking in /home/gridsan/groups/datasets.

If your project needs to share data, see our page on Group Shared Directories for directions on requesting a Group Shared Directory and how to use Group Shared Directories safely.

Public Datasets

Datacenter

MIT SuperCloud Dataset

This dataset consists of the labelled parts of the data described in the paper The MIT SuperCloud Dataset. The archive contains compressed CSV files consisting of monitoring data from the MIT SuperCloud system. For details on the capabilities offered by MIT SuperCloud cluster see Reuther, et. al. IEEE HPEC 2018.

Citation

If you use this data in your work, please cite the following paper

@misc{,
      title={The MIT SuperCloud Dataset},
      author={Siddharth Samsi and Matthew L Weiss and David Bestor and Baolin Li and Michael Jones and Albert Reuther and
      Daniel Edelman and William Arcand and Chansup Byun and John Holodnack and Matthew Hubbell and Jeremy Kepner and
      Anna Klein and Joseph McDonald and Adam Michaleas and Peter Michaleas and Lauren Milechin and Julia Mullen and
      Charles Yee and Benjamin Price and Andrew Prout and Antonio Rosa and Allan Vanterpool and Lindsey McEvoy and
      Anson Cheng and Devesh Tiwari and Vijay Gadepally},
      year={2021},
      eprint={2108.02037},
      archivePrefix={arXiv},
      primaryClass={cs.DC}
}

ImageNet Data

http://www.image-net.org/challenges/LSVRC/2012/

ImageNet Large Scale Visual Recognition Competition 2012 validation, test and training data.

LADI

https://github.com/LADI-Dataset/

A dataset of images collected by the Civil Air Patrol of various disasters. Two key distinctions are the low altitude, oblique perspective of the imagery and disaster-related features.The dataset currently employs a hierarchical labeling scheme of a five coarse categorical and then more specific annotations for each category. The initial dataset focuses on the Atlantic Hurricane and spring flooding seasons since 2015. We also provide annotations produced from the commercial Google Cloud Vision service and open source Places365 benchmark.

Microsoft COCO

http://cocodataset.org

COCO is a large-scale object detection, segmentation, and captioning dataset, with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context.