HDF5 Notebook

COMPUTING-BLOG
Scientific Computing Databases

Notebook

Introduction to HDF terms and definitions

Still under construction… mainly a collection of information of the manual at this stage…

Table of Contents

An HDF5 file is a container for storing a variety of scientific data and is composed of two primary types of objects: groups and datasets.

  • HDF5 group: a grouping structure containing zero or more HDF5 objects, together with supporting metadata
  • HDF5 dataset: a multidimensional array of data elements, together with supporting metadata

Any HDF5 group or dataset may have an associated attribute list. An HDF5 attribute is a user-defined HDF5 structure that provides extra information about an HDF5 object.

Working with groups and datasets is similar in many ways to working with directories and files in UNIX. As with UNIX directories and files, an HDF5 object in an HDF5 file is often referred to by its full path name (also called an absolute path name).

  • / means the root group.
  • /foo means a member of the root group called foo.
  • /foo/zoo means a member of the group foo, which in turn is a member of the root group.

HDF5 Groups

An HDF5 group is a structure containing zero or more HDF5 objects. A group has two parts:

  • A group header, which contains a group name and a list of group attributes.
  • A group symbol table, which is a list of the HDF5 objects that belong to the group.

HDF5 Datasets

A dataset is stored in a file in two parts: a header and a data array.

The header contains information that is needed to interpret the array portion of the dataset, as well as metadata (or pointers to metadata) that describes or annotates the dataset. Header information includes the name of the object, its dimensions, its number-type, information about how the data itself is stored on disk, and other information used by the library to speed up access to the dataset or maintain the file’s integrity.

There are four essential classes of information in any header: name, datatype, dataspace, and storage layout:

  • _Name_. A dataset name is a sequence of alphanumeric ASCII characters.
  • _Datatype_. HDF5 allows one to define many different kinds of datatypes. There are two categories of datatypes: atomic datatypes and compound datatypes. Atomic datatypes can also be system-specific, or NATIVE, and all datatypes can be names:

Atomic datatypes include integers and floating-point numbers. Each atomic type belongs to a particular class and has several properties: size, order, precision, and offset. In this introduction, we consider only a few of these properties.

Atomic classes include integer, float, string, bit field, and opaque. (Note: Only integer, float and string classes are available in the current implementation.)

Properties of integer types include size, order (endian-ness), and signed-ness (signed/unsigned).

Properties of float types include the size and location of the exponent and mantissa, and the location of the sign bit.

The datatypes that are supported in the current implementation are:

NATIVE datatypes. Although it is possible to describe nearly any kind of atomic datatype, most applications will use predefined datatypes that are supported by their compiler. In HDF5 these are called native datatypes. NATIVE datatypes are C-like datatypes that are generally supported by the hardware of the machine on which the library was compiled. In order to be portable, applications should almost always use the NATIVE designation to describe data values in memory.

(See Datatypes in the HDF User’s Guide for further information.)

A compound datatype is one in which a collection of several datatypes are represented as a single unit, a compound datatype, similar to a struct in C. The parts of a compound datatype are called members. The members of a compound datatype may be of any datatype, including another compound datatype. It is possible to read members from a compound type without reading the whole type.

Named datatypes. Normally each dataset has its own datatype, but sometimes we may want to share a datatype among several datasets. This can be done using a named datatype. A named datatype is stored in the file independently of any dataset, and referenced by all datasets that have that datatype. Named datatypes may have an associated attributes list. See Datatypes in the HDF User’s Guide for further information.

Dataspace. A dataset dataspace describes the dimensionality of the dataset. The dimensions of a dataset can be fixed (unchanging), or they may be unlimited, which means that they are extendible (i.e. they can grow larger).

Properties of a dataspace consist of the rank (number of dimensions) of the data array, the actual sizes of the dimensions of the array, and the maximum sizes of the dimensions of the array. For a fixed-dimension dataset, the actual size is the same as the maximum size of a dimension. When a dimension is unlimited, the maximum size is set to the value H5P_UNLIMITED. (An example below shows how to create extendible datasets.)

A dataspace can also describe portions of a dataset, making it possible to do partial I/O operations on selections. Selection is supported by the dataspace interface (H5S). Given an n-dimensional dataset, there are currently four ways to do partial selection:

Since I/O operations have two end-points, the raw data transfer functions require two dataspace arguments: one describes the application memory dataspace or subset thereof, and the other describes the file dataspace or subset thereof.

(See Dataspaces and Partial I/O in the HDF User’s Guide for further information.)

Storage layout. The HDF5 format makes it possible to store data in a variety of ways. The default storage layout format is contiguous, meaning that data is stored in the same linear way that it is organized in memory. Two other storage layout formats are currently defined for HDF5: compact, and chunked. In the future, other storage layouts may be added.

Compact storage is used when the amount of data is small and can be stored directly in the object header. (Note: Compact storage is not supported in this release.)

Chunked storage involves dividing the dataset into equal-sized “chunks” that are stored separately. Chunking has three important benefits.

  • It makes it possible to achieve good performance when accessing subsets of the datasets, even when the subset to be chosen is orthogonal to the normal storage order of the dataset.
  • It makes it possible to compress large datasets and still achieve good performance when accessing subsets of the dataset.
  • It makes it possible efficiently to extend the dimensions of a dataset in any direction.

(See Datasets in the HDF User’s Guide for further information. Also see Dataset Chunking Issues.)

Up to table of contents