What’s new in HDF5 1.10 – New Frontiers Initiative

Presenter: Gerd Heber, Applications Architect, HDF Group

Date: March 2018

Video: https://www.youtube.com/watch?v=5jjWCXSBYLc

Abstract

HDF5 is one of the most widely used I/O libraries and self-describing portable binary file formats for managing data on HPC systems. Introduced more than 19 years ago it is still under active development. In this talk, we will give an overview of the HDF5 development and release cycle in general and the new features in the current family of HDF5 releases.

We will describe a new storage layout called Virtual Dataset Layout, which allows one to access data stored in multiple HDF5 datasets across HDF5 files as a single (logical) HDF5 dataset. We will present a new way of reading data while it is being written to an HDF5 file (“Single Writer/ Multiple Reader” or SWMR feature). Both features were released in HDF5 version 1.10.0 in March 2016.

Applications’ memory footprint and efficient I/O is the focus of the imminent HDF5 1.10.1 release (April 2017).

We will explain how one can reduce application memory usage by taking control of the HDF5 metadata cache (“evict on close” feature) and how to accelerate I/O for applications that use HDF5 files as restart files by invoking the “cache image” feature.

Small-sized and random I/O accesses cause poor performance on many HPC systems. HDF5 release 1.10.1 introduces a new file space management strategy (“paged aggregation” feature) and an additional caching layer (“page buffering” feature) to mitigate the problem. If configured, the HDF5 library aggregates small metadata and raw data allocations into constant-sized well-aligned pages, which are suitable for page caching. Each page in memory corresponds to a page allocated in the file. Access to the file system is then performed on a single page or multiple of pages if they are contiguous. This ensures that small-sized accesses to the file system are avoided.

We will also give an overview of new features beyond the HDF5 1.10.1 release, including the forthcoming support for parallel compression and other I/O optimizations, which will be included in future releases.

Target Audience: Researchers and developers, data scientists

Prerequisites: General scientific/high-performance computing background

Training and Reference Materials: