Overview
Written by Mike Smith
Astronomical datasets are growing exponentially in size and complexity, with modern surveys like JWST, LSST, and others capturing multi-wavelength observations of billions of celestial objects, creating rich multimodal datasets that are invaluable for training machine learning models (and for discovering new astronomy!). The Multimodal Universe (MMU) is a ~100TB dataset of cross-matchable astronomical objects compiled by astronomers working across different astronomical disciplines. It is one of the most comprehensive multimodal astronomical datasets available, but its size creates a significant barrier to entry: downloading and hosting 100TB locally is not feasible for many researchers.
While Hugging…
Overview
Written by Mike Smith
Astronomical datasets are growing exponentially in size and complexity, with modern surveys like JWST, LSST, and others capturing multi-wavelength observations of billions of celestial objects, creating rich multimodal datasets that are invaluable for training machine learning models (and for discovering new astronomy!). The Multimodal Universe (MMU) is a ~100TB dataset of cross-matchable astronomical objects compiled by astronomers working across different astronomical disciplines. It is one of the most comprehensive multimodal astronomical datasets available, but its size creates a significant barrier to entry: downloading and hosting 100TB locally is not feasible for many researchers.
While Hugging Face’s dataset streaming service is excellent (and we have used it to host some pre-crossmatched MMU data), the current implementation does not allow for server-side cross-matching. This feature is essential as it makes it possible to use the MMU to study the same astronomical object across different wavelengths, modalities, and surveys.
What the astronomical ML community needs is a way to remotely cross-match and stream data directly from a remote server. The goal is to enable researchers to write something along the lines of:
from datasets import load_dataset
mmu = load_dataset("path/to/remote/MMU", streaming=True, crossmatch=("jwst", "hsc"))
And have the server handle cross-matching on the fly, streaming only the relevant paired observations. In a nutshell, this project aims to:
- Extend HF Datasets to support crossmatching for scientific data formats (starting with the MMU)
- Implement efficient spatial indexing and streaming using pre-computed HEALPix chunks for fast seeking across matched datasets
- Contribute this functionality upstream to the main HF Datasets branch, benefiting not just astronomy but any field requiring spatial or relational matching (climate science, particle physics, remote sensing, etc.)
- Make the MMU accessible to the broader ML community without requiring massive local storage
Technical Background
About the MMU Dataset
- Natively in HDF5 format hosted at the Flatiron Institute.
- For convenience and fast streaming, a copy of the dataset in parquet format is on Hugging Face as individual datasets within the Multimodal Universe org
What is Crossmatching?
Crossmatching is the process of finding the same astronomical object observed by different instruments and surveys. Objects are matched by their sky coordinates (ra, dec) within a tolerance, usually ~1 arcsecond. ra is right ascension and dec is declination which make up celestial coordinates. This allows researchers to study objects in different modalities.
What’s the Problem?
In the current state, as mentioned in the overview, if a researcher wants JWST and HSC crossmatched data, they would first have to download the JWST catalog and also the HSC catalog first. Then, they would have to run the crossmatching locally. Of course, the disk space requirements to do all this is very prohibitive.
Data Format Deliberation
The project contributors deliberated on data format options which can be succinctly summarized as: HDF5 vs. HATS. HDF5, as mentioned before, is MMU’s native format. Ultimately, it was decided that we would convert MMU to HATS. Before diving into the technicals of how we would do that, we must also take a deep look at both HDF5 and HATS to understand why.
HDF5
HDF5 is a hierarchical data format for working with large, complex, and heterogeneous data, usually scientific data.
HDF5 comprises of 3 parts:
- An actual file format for storing HDF5 data (
.h5) - A data model to locally organize and access HDF5 data from an application
- The software (libraries, language interfaces, and tools) to work with the format
The data model is really the core so we will go into this first. The two main objects in the data model are:
- Groups
- Datasets
You can think of groups as the same as UNIX directories. Every HDF5 file must contain a root / group (looks familiar right?). Along the same spirit, objects in a HDF5 file are described using their full path names: e.g.
/= root group/foo=foois within//foo/boo=boois withinfoowhich is within/
To recap, a HDF5 file is basically a UNIX filesystem contained within a single HDF5 file.
Moving onto datasets, using our UNIX filesystem analogy, we can think of them files as HDF5 datasets contain the raw data and organizes them. Datasets are usually contained in groups but don’t need to be (e.g. a dataset can just live in the root group).
The cool thing about datasets is that they are self-describing, i.e. you do not need external metadata. This means that datasets are comprised of 2 things: the metadata + the actual raw data.
The raw data is pretty straightforward, so we will go in-depth about the metadata. The metadata is comprised of 4 objects:
- Datatypes
- Dataspaces
- Properties
- Attributes - this is optional