From 5106d9643f65b90cb3437c7c0ea55f5b5f90227f Mon Sep 17 00:00:00 2001 From: Albert Cheng Date: Wed, 15 Dec 2004 18:21:23 -0500 Subject: [svn-r9672] Purpose: New document committed. Description: Report of the Data Transform work. --- doc/html/TechNotes/DataTransformReport.htm | 877 +++++++++++++++++++++++++++++ 1 file changed, 877 insertions(+) create mode 100644 doc/html/TechNotes/DataTransformReport.htm diff --git a/doc/html/TechNotes/DataTransformReport.htm b/doc/html/TechNotes/DataTransformReport.htm new file mode 100644 index 0000000..5a1a158 --- /dev/null +++ b/doc/html/TechNotes/DataTransformReport.htm @@ -0,0 +1,877 @@ + + + + + + + + +Arithmetic Data Transforms + + + + + + + + +
+ +

Arithmetic Data Transforms

+ +

Leon Arber, Albert +Cheng, William Wendling[1]

+ +

December 10, 2004

+ +

Purpose

+ +

Data can be stored and represented in many different +ways.  In most fields of science, for +example, the metric system is used for storing all data.  However, many fields of engineering still use +the English system.  In such scenarios, +there needs to be a way to easily perform arbitrary scaling of data.  The data transforms provide just such +functionality.  They allow arbitrary +arithmetic expressions to be applied to a dataset during read and write +operations.  This means that data can be +stored in Celsius in a data file, but read in and automatically converted to +Fahrenheit.  Alternatively, data that is +obtained in Fahrenheit can be written out to the data file in Celsius. 

+ +

 

+ +

Although a user can always manually modify the data they +read and write, having the data transform as a property means that the user +doesn’t have to worry about forgetting to call the conversion function or even +writing it in the first place.

+ +

 

+ +

Usage

+ +

The data transform functionality is implemented as a +property that is set on a dataset transfer property list.  There are two functions available: one for +setting the transform and another for finding out what transform, if any, is +currently set.

+ +

 

+ +

The function for setting the transform is:

+ +

herr_t +H5Pset_data_transform(hid_t plist_id, const char* expression)

+ +

 

+ +

plist_id +is the identifier of the dataset transfer property list on which the +data transform property should be set.

+ +

expression +is a pointer to a string of the form “(5/9.0)*(x-32)” which describes +the transform.

+ +

 

+ +

The function for getting the transform is:

+ +

ssize_t +H5Pget_data_transform(hid_t plist_id, char* expression, size_t size)

+ +

 

+ +

plist_id +is the identifier of the dataset transfer property list which will be +queried for its data transform property.

+ +

expression +is either NULL or a pointer to memory where the data transform string, +if present, will be copied.

+ +

size +is the number of bytes to copy from the transform string into +expression.  H5Pget_data_transform will +never copy more than the length of the transform expression.

+ +

 

+ +

Data Transform Expressions

+ +

Data transforms are set by passing a pointer to a string, +which is the data transform expression.  +This string describes what sort of arithmetic transform should be done +during data transfer of read or write.  +The string is a standard mathematical expression, as would be entered +into a something like MATLAB. 

+ +

Expressions are defined by the following context-free +grammar:

+ +

 

+ +

expr:=  term | term + +term | term - term

+ +

term := factor | factor * factor | factor / factor

+ +

factor :=  number | +symbol | - factor | + factor | ( expr )

+ +

symbol := [a-zA-Z][a-zA-Z0-9]*

+ +

number := INT | FLOAT

+ +

 

+ +

where INT is interpreted as a C long int and FLOAT is interpreted +as a C double

+ +

 

+ +

This grammar allows for order of operations (multiplication +and dividision take precedence over addition and subtraction), floating and +integer constants, and grouping of terms by way of parentheses.  Although the grammar allows symbols to be +arbitrary strings, this documentation will always use ‘x’ for symbols.

+ +

 

+ +

Within a transform expression, the symbol represents a +variable which contains the data to be manipulated.  For this reason, the terms symbol and +variable will be used interchangeably.  +Furthermore, in the current implementation of data transforms, all +symbols appearing in an expression are interpreted as referring to the same +dataset.  So, an expression such as +“alpha + 5” is equivalent to “x+5” and an expression such as “alpha + 3*beta + +5” is equivalent to “alpha + 3*alpha + 5” which is equivalent to “4*x + +5”.   

+ +

 

+ +

Data Transform Implementation

+ +

When the data transform property of a dataset transfer +property list is set, a parse tree of the expression is immediately generated +and its root is saved in the property list.  +The generation of the parse involves several steps.

+ +

 

+ +

First, the expression is reduced, so as to simply the final +parse and speed up the transform operations.  +Expressions such as “(5/9.0) * (x-32)” will be reduced to +“.555555*(x-32).”  While further +simplification is algebraically possible, the data transform code will only +reduce simple trivial arithmetic operations.  +

+ +

 

+ +

Then, this reduced expression is parsed into a set of +tokens, from which the parse tree is generated.  +From the expression “(5/9.0)*(x-32),” for example, the following parse +tree would be created:

+ +

 

+ +

 

+ +

 

+ +

 

+ +

                                               *

+ +

                                          /          \  +

+ +

                                       .555555     -

+ +

                                                     /   \                                               

+ +

                                                     x  +32

+ +

          

+ +

HDread with Data Transform Expressions

+ +

When a read is performed with a dataset transfer property +list that has the data transform property set, the following sequence of events +occurs:

+ +

 

+ +
    +
  1. A + piece of the file is read into memory
  2. +
  3. The + data transform is performed on this piece of memory
  4. +
  5. This piece + of memory is then copied to the user
  6. +
  7. Steps + 1 – 3 are repeated until the read is complete.
  8. +
+ +

 

+ +

Step 2 works like this:

+ +

 

+ +
    +
  1. The + function responsible for doing the transform is passed a buffer and is + informed what type of data is inside this buffer and how many elements + there are.
  2. +
  3. This + buffer is then treated as the variable in the data transform expression + and the transform expression is applied.
  4. +
  5. The + transformed buffer is returned to the library.
  6. +
+ +

 

+ +

If the transform expression is “(5/9.0)*(x-32),” with the +parse tree shown above and the buffer contains [-10 0 10 50 100], then the +intermediate steps involved in the transform are:

+ +

 

+ +
    +
  1. First, the (x-32) subexpression is evaluated.  Now the buffer would contain           [-42 -32 -22 18 68]
  2. +
  3. Then, + the .55555 * part of the expression is evaluated.  Now the buffer would contain: [-23.3333 + -17.7777 -12.2222 9.9999 37.7777]
  4. +
  5. Now, + the transform would be completed and the resulting buffer returned.
  6. +
+ +

 

+ +

Note that the original data in the file was not modified.

+ +

 

+ +

HDwrite with Data Transform Expressions

+ +

The process of a write works much the same way, but in the +reverse order.  When a file is written +out with a dataset transfer property list that has the data transform property +set:

+ +

 

+ +
    +
  1. The + user passes a buffer to HDwrite, along with the type and number of + elements.
  2. +
  3. The + data transform is performed on a copy of this piece of memory.
  4. +
  5. This + copy with the transformed data is then written out to the file.
  6. +
+ +

 

+ +

Step 2 works exactly as in the read example.  Note that the user’s data is not modified.  Also, since the transform property is not +saved with the dataset, in order to recover the original data, a user must know +the inverse of the transform that was applied in order to recover it.  In the case of “(5/9.0)*(x-32)” this inverse +would be “(9/5.0)*x + 32”.  Reading from +a data file that had previously been written out with a transform string of +“(5/9.0)*(x-32)” with a transform string of “(9/5.0)*x + 32” would effectively +recover the original data the author of the file had been using.[2]

+ +

 

+ +

Mixed Mode and Truncation

+ +

Because the data transform sits and modifies data between +the file space and the memory space, various effects can occur that are the +result of the typecasting that may be involved in the operations.   In addition, because constants in the data +transform expression can be either INT or FLOAT, the data transform itself can +be a source of truncation.

+ +

 

+ +

In the example above, the reason that the transform +expression is always written as “(5/9.0)*(x-32)” is because, if it were written +without a floating point constant, it would always evaluate to 0.  The expression “(5/9)*(x-32)” would, when +set, get reduced to “0*(x-32)” because both 5 and 9 would get read as C long +ints and, when divided, the result would get truncated to 0.  This resulting expression, “0*(x-32),” would +cause any data read or written to be saved as an array of all 0’s. 

+ +

 

+ +

Another source of unpredictability caused by truncation +occurs when intermediate data is of a type that is more precise than the +destination memory type.  For example, if +the transform expression “(1/2.0)*x” is applied to data read from a file that +is being read into an integer memory buffer, the results can be +unpredictable.  If the source array is [1 +2 3 4], then the resulting array could be either [0 1 1 2] or [0 0 1 1], +depending on the floating point unit of the processors.  Note that this result is independent of the +source data type.  It doesn’t matter if +the source data is integer or floating point because the 2.0 in the data +transform expression will cause everything to be evaluated in a floating-point +context.

+ +

 

+ +

When setting transform expressions, care must be taken to +ensure that the truncation does not adversely affect the data.  A workaround for the possible effects of a +transform such as “(1/2.0) * x” would be to used the transform expression +“(1/2.0)*x + 0.5” instead of the original.  +This will ensure that all truncation rounds up, with the possible +exception of a boundary condition.

+ +

 

+ +

Data Transform Example

+ +

The following code snippet shows an example using data +transform, where the data transform property is set and a write is +performed.  Then, a read is performed +with no data transform property set.  It +is assumed that dataset is a dataset +that has been opened and windchillF and +windchillC are both arrays that hold +floating point data.  The result of this +snippet is to fill windchillC with the +data in windchillF, converted to +Celcius.

+ +

 

+ +

hid_t dxpl_id_c_to_f;

+ +

const char* c_to_f = +“(9/5.0)*x + 32”;

+ +

 

+ +

/* Create the dataset +transfer property list */

+ +

    dxpl_id_c_to_f = +H5Pcreate(H5P_DATASET_XFER);

+ +

 

+ +

/* Set the data transform +to be used on the read*/

+ +

    H5Pset_data_transform(dxpl_id_c_to_f, +c_to_f);

+ +

 

+ +

   

+ +

/*

+ +

* Write the data to the +dataset using the f_to_c transform

+ +

*/

+ +

    status = H5Dwrite(dataset, H5T_NATIVE_FLOAT, +H5S_ALL, H5S_ALL, dxpl_id_f_to_c, windchillF);

+ +

   

+ +

/* Read the data with the +c_to_f data transform */

+ +

    H5Dread(dataset, H5T_NATIVE_FLOAT, H5S_ALL, +H5S_ALL, H5P_DEFAULT, windchillC);

+ +

 

+ +

H5Pget_data_transform Details

+ +

Querying the data transform string of a dataset transfer +property list requires the use of the H5Pget_data_transform function.  This function provides the ability to both +query the size of the string stored and retrieve part or all of it.  Note that H5Pget_data_transform will return +the expression that was set by H5Pset_data_transform.  The reduced transform string, computed when +H5Pset_data_transform is called, is not stored in string form and is not +available to the user.

+ +

 

+ +

In order to ascertain the size of the string, a NULL expression should be passed to the +function.  This will make the function +return the length of the transform string (not including the terminated ‘\0’ +character).

+ +

 

+ +

To actually retrieve the string, a pointer to a valid memory +location should be passed in for expression and +the number of bytes from the string that should be copied to that memory +location should be passed in as size.

+ +

 

+ +

Further Work

+ +

Some additional functionality can still be added to the data +transform.  Currently the most important +feature lacking is the addition of operators, such as exponentiation and the +trigonometric functions.  Although +exponentiation can be explicitly carried with a transform expression such as +“x*x*x” it may be easier to support expression like “x^3.” Also lacking are the +commonly used trigonometric functions, such as sin, cos, and tan. 

+ +

 

+ +

Popular constants could also be added, such as π or +e. 

+ +

 

+ +

More advanced functionality, such as the ability to perform +a transform on multiple datasets is also a possibility, but is a feature is +more a completely new addition than an extension to data transforms. 

+ +
+ +

+ +
+ + + +
+ +

[1] Mr. +Wendling, who involved in the initial design and implemented the expression +parser, has left NCSA.

+ +
+ +
+ +

[2] See the +h5_dtransform.c example in the examples directory of the hdf5 library for just +such an illustration.

+ +
+ +
+ + + + -- cgit v0.12