NCSA

[ HDF5 Tutorial Top ]

Creating a Dataset


Contents:


What is a Dataset?

A dataset is a multidimensional array of data elements, together with supporting metadata. To create a dataset, the application program must specify the location at which to create the dataset, the dataset name, the datatype and dataspace of the data array, and the dataset creation property list.

Datatypes

A datatype is a collection of datatype properties, all of which can be stored on disk, and which when taken as a whole, provide complete information for data conversion to or from that datatype.

There are two categories of datatypes in HDF5: atomic and compound datatypes. An atomic datatype is a datatype which cannot be decomposed into smaller datatype units at the API level. These include the integer, float, date and time, string, bitfield, and opaque datatypes. A compound datatype is a collection of one or more atomic datatypes and/or small arrays of such datatypes.

Figure 5.1 shows the HDF5 datatypes. Some of the HDF5 predefined atomic datatypes are listed in Figures 5.2a and 5.2b. In this tutorial, we consider only HDF5 predefined integers. For further information on datatypes, see The Datatype Interface (H5T) in the HDF5 User's Guide.

Fig 5.1   HDF5 datatypes


                                          +--  integer
                                          +--  floating point
                        +---- atomic  ----+--  date and time
                        |                 +--  character string
       HDF5 datatypes --|                 +--  bitfield
                        |                 +--  opaque
                        |
                        +---- compound

Fig. 5.2a   Examples of HDF5 predefined datatypes
Datatype Description
H5T_STD_I32LE Four-byte, little-endian, signed, two's complement integer
H5T_STD_U16BE Two-byte, big-endian, unsigned integer
H5T_IEEE_F32BE Four-byte, big-endian, IEEE floating point
H5T_IEEE_F64LE Eight-byte, little-endian, IEEE floating point
H5T_C_S1 One-byte, null-terminated string of eight-bit characters
Fig. 5.2b   Examples of HDF5 predefined native datatypes
Native Datatype Corresponding C or FORTRAN Type
C:  
H5T_NATIVE_INT int
H5T_NATIVE_FLOAT float
H5T_NATIVE_CHAR char
H5T_NATIVE_DOUBLE double
H5T_NATIVE_LDOUBLE long double
FORTRAN:  
H5T_NATIVE_INT integer
H5T_NATIVE_REAL real
H5T_NATIVE_DOUBLE double precision
H5T_NATIVE_CHAR character

Datasets and Dataspaces

A dataspace describes the dimensionality of the data array. A dataspace is either a regular N-dimensional array of data points, called a simple dataspace, or a more general collection of data points organized in another manner, called a complex dataspace. Figure 5.3 shows HDF5 dataspaces. In this tutorial, we only consider simple dataspaces.

Fig 5.3   HDF5 dataspaces


                         +-- simple
       HDF5 dataspaces --|
                         +-- complex

The dimensions of a dataset can be fixed (unchanging), or they may be unlimited, which means that they are extensible. A dataspace can also describe a portion of a dataset, making it possible to do partial I/O operations on selections.

Dataset Creation Property Lists

When creating a dataset, HDF5 allows the user to specify how raw data is organized and/or compressed on disk. This information is stored in a dataset creation property list and passed to the dataset interface. The raw data on disk can be stored contiguously (in the same linear way that it is organized in memory), partitioned into chunks, stored externally, etc. In this tutorial, we use the default dataset creation property list; that is, contiguous storage layout and no compression are used. For more information about dataset creation property lists, see The Dataset Interface (H5D) in the HDF5 User's Guide.

In HDF5, datatypes and dataspaces are independent objects which are created separately from any dataset that they might be attached to. Because of this, the creation of a dataset requires definition of the datatype and dataspace. In this tutorial, we use HDF5 predefined datatypes (integer) and consider only simple dataspaces. Hence, only the creation of dataspace objects is needed.

To create an empty dataset (no data written) the following steps need to be taken:

  1. Obtain the location identifier where the dataset is to be created.
  2. Define the dataset characteristics and the dataset creation property list.
  3. Create the dataset.
  4. Close the datatype, the dataspace, and the property list if necessary.
  5. Close the dataset.
To create a simple dataspace, the calling program must contain a call to create and close the dataspace. For example:

C:

   space_id = H5Screate_simple (rank, dims, maxdims);
   status = H5Sclose (space_id );
FORTRAN:
   CALL h5screate_simple_f (rank, dims, space_id, hdferr, maxdims=max_dims)
        or
   CALL h5screate_simple_f (rank, dims, space_id, hdferr)

   CALL h5sclose_f (space_id, hdferr)
To create a dataset, the calling program must contain calls to create and close the dataset. For example:

C:

   dset_id = H5Dcreate (hid_t loc_id, const char *name, hid_t type_id,
                          hid_t space_id, hid_t creation_prp);
   status = H5Dclose (dset_id);
FORTRAN:
   CALL h5dcreate_f (loc_id, name, type_id, space_id, dset_id, &
                     hdferr, creation_prp=creat_plist_id)
        or
   CALL h5dcreate_f (loc_id, name, type_id, space_id, dset_id, hdferr)

   CALL h5dclose_f (dset_id, hdferr)
If using the pre-defined datatypes in FORTRAN, then a call must be made to initialize and terminate access to the pre-defined datatypes:
  CALL h5init_types_f (hdferr) 
  CALL h5close_types_f (hdferr)
h5init_types_f must be called before any HDF5 library subroutine calls are made; h5close_types_f must be called after the final HDF5 library subroutine call. See the programming example below for an illustration of the use of these calls.

Programming Example

Description

The following example shows how to create an empty dataset. It creates a file called dset.h5 in the C version (dsetf.h5 in Fortran), defines the dataset dataspace, creates a dataset which is a 4x6 integer array, and then closes the dataspace, the dataset, and the file.
NOTE: To download a tar file of the examples, including a Makefile, please go to the References page of this tutorial.

Remarks

File Contents

The contents of the file dset.h5 (dsetf.h5 for FORTRAN) are shown in Figure 5.4 and Figures 5.5a and 5.5b.

Figure 5.4   Contents of dset.h5 ( dsetf.h5)
Figure 5.5a   dset.h5 in DDL Figure 5.5b   dsetf.h5 in DDL
HDF5 "dset.h5" {
GROUP "/" {
   DATASET "dset" {
      DATATYPE { H5T_STD_I32BE }
      DATASPACE { SIMPLE ( 4, 6 ) / ( 4, 6 ) }
      DATA {
         0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0
      }
   }
}
}
      
HDF5 "dsetf.h5" {
GROUP "/" {
   DATASET "dset" {
      DATATYPE { H5T_STD_I32BE }
      DATASPACE { SIMPLE ( 6, 4 ) / ( 6, 4 ) }
      DATA {
         0, 0, 0, 0,
         0, 0, 0, 0,
         0, 0, 0, 0,
         0, 0, 0, 0,
         0, 0, 0, 0,
         0, 0, 0, 0
      }
   }
}
}

Note in Figures 5.5a and 5.5b that H5T_STD_I32BE, a 32-bit Big Endian integer, is an HDF atomic datatype.

Dataset Definition in DDL

The following is the simplified DDL dataset definition:

Fig. 5.6   HDF5 Dataset Definition

      <dataset> ::= DATASET "<dataset_name>" { <datatype>
                                               <dataspace>
                                               <data>
                                               <dataset_attribute>* }

      <datatype> ::= DATATYPE { <atomic_type> }

      <dataspace> ::= DATASPACE { SIMPLE <current_dims> / <max_dims> }

      <dataset_attribute> ::= <attribute>


NCSA
The National Center for Supercomputing Applications

University of Illinois at Urbana-Champaign

hdfhelp@@ncsa.uiuc.edu
Describes HDF5 Release 1.2.2, June 2000
Last Modified: April 5, 2000