summaryrefslogtreecommitdiffstats
path: root/funtools/doc/pod/funtext.pod
diff options
context:
space:
mode:
Diffstat (limited to 'funtools/doc/pod/funtext.pod')
-rw-r--r--funtools/doc/pod/funtext.pod718
1 files changed, 718 insertions, 0 deletions
diff --git a/funtools/doc/pod/funtext.pod b/funtools/doc/pod/funtext.pod
new file mode 100644
index 0000000..9e20d1f
--- /dev/null
+++ b/funtools/doc/pod/funtext.pod
@@ -0,0 +1,718 @@
+=pod
+
+=head1 NAME
+
+
+
+B<Funtext: Support for Column-based Text Files>
+
+
+
+=head1 SYNOPSIS
+
+
+
+
+
+This document contains a summary of the options for processing column-based
+text files.
+
+
+
+=head1 DESCRIPTION
+
+
+
+
+
+
+Funtools will automatically sense and process "standard"
+column-based text files as if they were FITS binary tables without any
+change in Funtools syntax. In particular, you can filter text files
+using the same syntax as FITS binary tables:
+
+ fundisp foo.txt'[cir 512 512 .1]'
+ fundisp -T foo.txt > foo.rdb
+ funtable foo.txt'[pha=1:10,cir 512 512 10]' foo.fits
+
+
+
+The first example displays a filtered selection of a text file. The
+second example converts a text file to an RDB file. The third example
+converts a filtered selection of a text file to a FITS binary table.
+
+
+Text files can also be used in Funtools image programs. In this case,
+you must provide binning parameters (as with raw event files), using
+the bincols keyword specifier:
+
+
+ bincols=([xname[:tlmin[:tlmax:[binsiz]]]],[yname[:tlmin[:tlmax[:binsiz]]]
+
+
+For example:
+
+ funcnts foo'[bincols=(x:1024,y:1024)]' "ann 512 512 0 10 n=10"
+
+
+B<Standard Text Files>
+
+
+Standard text files have the following characteristics:
+
+
+
+=over 4
+
+
+
+
+=item *
+
+Optional comment lines start with #
+
+
+=item *
+
+Optional blank lines are considered comments
+
+
+=item *
+
+An optional table header consists of the following (in order):
+
+
+=over 4
+
+
+
+
+=item *
+
+a single line of alpha-numeric column names
+
+
+=item *
+
+an optional line of unit strings containing the same number of cols
+
+
+=item *
+
+an optional line of dashes containing the same number of cols
+
+
+=back
+
+
+
+
+=item *
+
+Data lines follow the optional header and (for the present) consist of
+ the same number of columns as the header.
+
+
+=item *
+
+Standard delimiters such as space, tab, comma, semi-colon, and bar.
+
+
+=back
+
+
+
+
+Examples:
+
+
+ # rdb file
+ foo1 foo2 foo3 foos
+ ---- ---- ---- ----
+ 1 2.2 3 xxxx
+ 10 20.2 30 yyyy
+
+ # multiple consecutive whitespace and dashes
+ foo1 foo2 foo3 foos
+ --- ---- ---- ----
+ 1 2.2 3 xxxx
+ 10 20.2 30 yyyy
+
+ # comma delims and blank lines
+ foo1,foo2,foo3,foos
+
+ 1,2.2,3,xxxx
+ 10,20.2,30,yyyy
+
+ # bar delims with null values
+ foo1|foo2|foo3|foos
+ 1||3|xxxx
+ 10|20.2||yyyy
+
+ # header-less data
+ 1 2.2 3 xxxx
+ 10 20.2 30 yyyy
+
+
+
+The default set of token delimiters consists of spaces, tabs, commas,
+semi-colons, and vertical bars. Several parsers are used
+simultaneously to analyze a line of text in different ways. One way
+of analyzing a line is to allow a combination of spaces, tabs, and
+commas to be squashed into a single delimiter (no null values between
+consecutive delimiters). Another way is to allow tab, semi-colon, and
+vertical bar delimiters to support null values, i.e. two consecutive
+delimiters implies a null value (e.g. RDB file). A successful parser
+is one which returns a consistent number of columns for all rows, with
+each column having a consistent data type. More than one parser can
+be successful. For now, it is assumed that successful parsers all
+return the same tokens for a given line. (Theoretically, there are
+pathological cases, which will be taken care of as needed). Bad parsers
+are discarded on the fly.
+
+
+If the header does not exist, then names "col1", "col2", etc. are
+assigned to the columns to allow filtering. Furthermore, data types
+for each column are determined by the data types found in the columns
+of the first data line, and can be one of the following: string, int,
+and double. Thus, all of the above examples return the following
+display:
+
+ fundisp foo'[foo1>5]'
+ FOO1 FOO2 FOO3 FOOS
+ ---------- --------------------- ---------- ------------
+ 10 20.20000000 30 yyyy
+
+
+B<Comments Convert to Header Params>
+
+
+Comments which precede data rows are converted into header parameters and
+will be written out as such using funimage or funhead. Two styles of comments
+are recognized:
+
+
+1. FITS-style comments have an equal sign "=" between the keyword and
+value and an optional slash "/" to signify a comment. The strict FITS
+rules on column positions are not enforced. In addition, strings only
+need to be quoted if they contain whitespace. For example, the following
+are valid FITS-style comments:
+
+
+ # fits0 = 100
+ # fits1 = /usr/local/bin
+ # fits2 = "/usr/local/bin /opt/local/bin"
+ # fits3c = /usr/local/bin /opt/local/bin /usr/bin
+ # fits4c = "/usr/local/bin /opt/local/bin" / path dir
+
+
+Note that the fits3c comment is not quoted and therefore its value is the
+single token "/usr/local/bin" and the comment is "opt/local/bin /usr/bin".
+This is different from the quoted comment in fits4c.
+
+
+2. Free-form comments can have an optional colon separator between the
+keyword and value. In the absence of quote, all tokens after the
+keyword are part of the value, i.e. no comment is allowed. If a string
+is quoted, then slash "/" after the string will signify a comment.
+For example:
+
+
+ # com1 /usr/local/bin
+ # com2 "/usr/local/bin /opt/local/bin"
+ # com3 /usr/local/bin /opt/local/bin /usr/bin
+ # com4c "/usr/local/bin /opt/local/bin" / path dir
+
+ # com11: /usr/local/bin
+ # com12: "/usr/local/bin /opt/local/bin"
+ # com13: /usr/local/bin /opt/local/bin /usr/bin
+ # com14c: "/usr/local/bin /opt/local/bin" / path dir
+
+
+
+Note that com3 and com13 are not quoted, so the whole string is part of
+the value, while comz4c and com14c are quoted and have comments following
+the values.
+
+
+Some text files have column name and data type information in the header.
+You can specify the format of column information contained in the
+header using the "hcolfmt=" specification. See below for a detailed
+description.
+
+B<Multiple Tables in a Single File>
+
+
+Multiple tables are supported in a single file. If an RDB-style file
+is sensed, then a ^L (vertical tab) will signify end of
+table. Otherwise, an end of table is sensed when a new header (i.e.,
+all alphanumeric columns) is found. (Note that this heuristic does not
+work for single column tables where the column type is ASCII and the
+table that follows also has only one column.) You also can specify
+characters that signal an end of table condition using the B<eot=>
+keyword. See below for details.
+
+
+You can access the nth table (starting from 1) in a multi-table file
+by enclosing the table number in brackets, as with a FITS extension:
+
+
+ fundisp foo'[2]'
+
+The above example will display the second table in the file.
+(Index values start at 1 in oder to maintain logical compatibility
+with FITS files, where extension numbers also start at 1).
+
+
+B<TEXT() Specifier>
+
+
+As with ARRAY() and EVENTS() specifiers for raw image arrays and raw
+event lists respectively, you can use TEXT() on text files to pass
+key=value options to the parsers. An empty set of keywords is
+equivalent to not having TEXT() at all, that is:
+
+
+ fundisp foo
+ fundisp foo'[TEXT()]'
+
+
+are equivalent. A multi-table index number is placed before the TEXT()
+specifier as the first token, when indexing into a multi-table:
+
+ fundisp foo'[2,TEXT(...)]'
+
+
+The filter specification is placed after the TEXT() specifier, separated
+by a comma, or in an entirely separate bracket:
+
+
+ fundisp foo'[TEXT(...),circle 512 512 .1]'
+ fundisp foo'[2,TEXT(...)][circle 512 512 .1]'
+
+
+B<Text() Keyword Options>
+
+
+The following is a list of keywords that can be used within the TEXT()
+specifier (the first three are the most important):
+
+
+
+=over 4
+
+
+
+
+
+
+=item *
+
+delims="[delims]"
+
+
+Specify token delimiters for this file. Only a single parser having these
+delimiters will be used to process the file.
+
+ fundisp foo.fits'[TEXT(delims="!")]'
+ fundisp foo.fits'[TEXT(delims="\t%")]'
+
+
+
+
+
+=item *
+
+comchars="[comchars]"
+
+
+Specify comment characters. You must include "\n" to allow blank lines.
+These comment characters will be used for all standard parsers (unless delims
+are also specified).
+
+ fundisp foo.fits'[TEXT(comchars="!\n")]'
+
+
+
+
+
+=item *
+
+cols="[name1:type1 ...]"
+
+
+Specify names and data type of columns. This overrides header
+names and/or data types in the first data row or default names and
+data types for header-less tables.
+
+ fundisp foo.fits'[TEXT(cols="x:I,y:I,pha:I,pi:I,time:D,dx:E,dy:e")]'
+
+
+If the column specifier is the only keyword, then the cols= is not
+required (in analogy with EVENTS()):
+
+ fundisp foo.fits'[TEXT(x:I,y:I,pha:I,pi:I,time:D,dx:E,dy:e)]'
+
+Of course, an index is allowed in this case:
+
+ fundisp foo.fits'[2,TEXT(x:I,y:I,pha:I,pi:I,time:D,dx:E,dy:e)]'
+
+
+
+
+
+=item *
+
+eot="[eot delim]"
+
+
+Specify end of table string specifier for multi-table files. RDB
+files support ^L. The end of table specifier is a string and the whole
+string must be found alone on a line to signify EOT. For example:
+
+ fundisp foo.fits'[TEXT(eot="END")]'
+
+will end the table when a line contains "END" is found. Multiple lines
+are supported, so that:
+
+ fundisp foo.fits'[TEXT(eot="END\nGAME")]'
+
+will end the table when a line contains "END" followed by a line
+containing "GAME".
+
+In the absence of an EOT delimiter, a new table will be sensed when a new
+header (all alphanumeric columns) is found.
+
+
+
+
+=item *
+
+null1="[datatype]"
+
+
+Specify data type of a single null value in row 1.
+Since column data types are determined by the first row, a null value
+in that row will result in an error and a request to specify names and
+data types using cols=. If you only have a one null in row 1, you don't
+need to specify all names and columns. Instead, use null1="type" to
+specify its data type.
+
+
+
+
+=item *
+
+alen=[n]
+
+
+Specify size in bytes for ASCII type columns.
+FITS binary tables only support fixed length ASCII columns, so a
+size value must be specified. The default is 16 bytes.
+
+
+
+
+=item *
+
+nullvalues=["true"|"false"]
+
+
+Specify whether to expect null values.
+Give the parsers a hint as to whether null values should be allowed. The
+default is to try to determine this from the data.
+
+
+
+
+=item *
+
+whitespace=["true"|"false"]
+
+
+Specify whether surrounding white space should be kept as part of
+string tokens. By default surrounding white space is removed from
+tokens.
+
+
+
+
+=item *
+
+header=["true"|"false"]
+
+
+Specify whether to require a header. This is needed by tables
+containing all string columns (and with no row containing dashes), in
+order to be able to tell whether the first row is a header or part of
+the data. The default is false, meaning that the first row will be
+data. If a row dashes are present, the previous row is considered the
+column name row.
+
+
+
+
+=item *
+
+units=["true"|"false"]
+
+
+Specify whether to require a units line.
+Give the parsers a hint as to whether a row specifying units should be
+allowed. The default is to try to determine this from the data.
+
+
+
+
+=item *
+
+i2f=["true"|"false"]
+
+
+Specify whether to allow int to float conversions.
+If a column in row 1 contains an integer value, the data type for that
+column will be set to int. If a subsequent row contains a float in
+that same column, an error will be signaled. This flag specifies that,
+instead of an error, the float should be silently truncated to
+int. Usually, you will want an error to be signaled, so that you can
+specify the data type using cols= (or by changing the value of
+the column in row 1).
+
+
+
+
+=item *
+
+comeot=["true"|"false"|0|1|2]
+
+
+Specify whether comment signifies end of table.
+If comeot is 0 or false, then comments do not signify end of table and
+can be interspersed with data rows. If the value is true or 1 (the
+default for standard parsers), then non-blank lines (e.g. lines
+beginning with '#') signify end of table but blanks are allowed
+between rows. If the value is 2, then all comments, including blank
+lines, signify end of table.
+
+
+
+
+=item *
+
+lazyeot=["true"|"false"]
+
+
+Specify whether "lazy" end of table should be permitted (default is
+true for standard formats, except rdb format where explicit ^L is required
+between tables). A lazy EOT can occur when a new table starts directly
+after an old one, with no special EOT delimiter. A check for this EOT
+condition is begun when a given row contains all string tokens. If, in
+addition, there is a mismatch between the number of tokens in the
+previous row and this row, or a mismatch between the number of string
+tokens in the prev row and this row, a new table is assumed to have
+been started. For example:
+
+ ival1 sval3
+ ----- -----
+ 1 two
+ 3 four
+
+ jval1 jval2 tval3
+ ----- ----- ------
+ 10 20 thirty
+ 40 50 sixty
+
+Here the line "jval1 ..." contains all string tokens. In addition,
+the number of tokens in this line (3) differs from the number of
+tokens in the previous line (2). Therefore a new table is assumed
+to have started. Similarly:
+
+ ival1 ival2 sval3
+ ----- ----- -----
+ 1 2 three
+ 4 5 six
+
+ jval1 jval2 tval3
+ ----- ----- ------
+ 10 20 thirty
+ 40 50 sixty
+
+Again, the line "jval1 ..." contains all string tokens. The number of
+string tokens in the previous row (1) differs from the number of
+tokens in the current row(3). We therefore assume a new table as been
+started. This lazy EOT test is not performed if lazyeot is explicitly
+set to false.
+
+
+
+
+=item *
+
+hcolfmt=[header column format]
+
+
+Some text files have column name and data type information in the header.
+For example, VizieR catalogs have headers containing both column names
+and data types:
+
+ #Column e_Kmag (F6.3) ?(k_msigcom) K total magnitude uncertainty (4) [ucd=ERROR]
+ #Column Rflg (A3) (rd_flg) Source of JHK default mag (6) [ucd=REFER_CODE]
+ #Column Xflg (I1) [0,2] (gal_contam) Extended source contamination (10) [ucd=CODE_MISC]
+
+
+while Sextractor files have headers containing column names alone:
+
+
+ # 1 X_IMAGE Object position along x [pixel]
+ # 2 Y_IMAGE Object position along y [pixel]
+ # 3 ALPHA_J2000 Right ascension of barycenter (J2000) [deg]
+ # 4 DELTA_J2000 Declination of barycenter (J2000) [deg]
+
+The hcolfmt specification allows you to describe which header lines
+contain column name and data type information. It consists of a string
+defining the format of the column line, using "$col" (or "$name") to
+specify placement of the column name, "$fmt" to specify placement of the
+data format, and "$skip" to specify tokens to ignore. You also can
+specify tokens explicitly (or, for those users familiar with how
+sscanf works, you can specify scanf skip specifiers using "%*").
+For example, the VizieR hcolfmt above might be specified in several ways:
+
+ Column $col ($fmt) # explicit specification of "Column" string
+ $skip $col ($fmt) # skip one token
+ %*s $col ($fmt) # skip one string (using scanf format)
+
+while the Sextractor format might be specified using:
+
+ $skip $col # skip one token
+ %*d $col # skip one int (using scanf format)
+
+You must ensure that the hcolfmt statement only senses actual column
+definitions, with no false positives or negatives. For example, the
+first Sextractor specification, "$skip $col", will consider any header
+line containing two tokens to be a column name specifier, while the
+second one, "%*d $col", requires an integer to be the first token. In
+general, it is preferable to specify formats as explicitly as
+possible.
+
+
+Note that the VizieR-style header info is sensed automatically by the
+funtools standard VizieR-like parser, using the hcolfmt "Column $col
+($fmt)". There is no need for explicit use of hcolfmt in this case.
+
+
+
+
+=item *
+
+debug=["true"|"false"]
+
+
+Display debugging information during parsing.
+
+
+
+=back
+
+
+
+B<Environment Variables>
+
+
+Environment variables are defined to allow many of these TEXT() values to be
+set without having to include them in TEXT() every time a file is processed:
+
+
+ keyword environment variable
+ ------- --------------------
+ delims TEXT_DELIMS
+ comchars TEXT_COMCHARS
+ cols TEXT_COLUMNS
+ eot TEXT_EOT
+ null1 TEXT_NULL1
+ alen TEXT_ALEN
+ bincols TEXT_BINCOLS
+ hcolfmt TEXT_HCOLFMT
+
+
+B<Restrictions and Problems>
+
+
+As with raw event files, the '+' (copy extensions) specifier is not
+supported for programs such as funtable.
+
+
+String to int and int to string data conversions are allowed by the
+text parsers. This is done more by force of circumstance than by
+conviction: these transitions often happens with VizieR catalogs,
+which we want to support fully. One consequence of allowing these
+transitions is that the text parsers can get confused by columns which
+contain a valid integer in the first row and then switch to a
+string. Consider the following table:
+
+ xxx yyy zzz
+ ---- ---- ----
+ 111 aaa bbb
+ ccc 222 ddd
+
+The xxx column has an integer value in row one a string in row two,
+while the yyy column has the reverse. The parser will erroneously
+treat the first column as having data type int:
+
+ fundisp foo.tab
+ XXX YYY ZZZ
+ ---------- ------------ ------------
+ 111 'aaa' 'bbb'
+ 1667457792 '222' 'ddd'
+
+while the second column is processed correctly. This situation can be avoided
+in any number of ways, all of which force the data type of the first column
+to be a string. For example, you can edit the file and explicitly quote the
+first row of the column:
+
+ xxx yyy zzz
+ ---- ---- ----
+ "111" aaa bbb
+ ccc 222 ddd
+
+ [sh] fundisp foo.tab
+ XXX YYY ZZZ
+ ------------ ------------ ------------
+ '111' 'aaa' 'bbb'
+ 'ccc' '222' 'ddd'
+
+You can edit the file and explicitly set the data type of the first column:
+
+ xxx:3A yyy zzz
+ ------ ---- ----
+ 111 aaa bbb
+ ccc 222 ddd
+
+ [sh] fundisp foo.tab
+ XXX YYY ZZZ
+ ------------ ------------ ------------
+ '111' 'aaa' 'bbb'
+ 'ccc' '222' 'ddd'
+
+You also can explicitly set the column names and data types of all columns,
+without editing the file:
+
+ [sh] fundisp foo.tab'[TEXT(xxx:3A,yyy:3A,zzz:3a)]'
+ XXX YYY ZZZ
+ ------------ ------------ ------------
+ '111' 'aaa' 'bbb'
+ 'ccc' '222' 'ddd'
+
+The issue of data type transitions (which to allow and which to disallow)
+is still under discussion.
+
+
+
+=head1 SEE ALSO
+
+
+
+See funtools(n) for a list of Funtools help pages
+
+
+
+=cut